[musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
@ 2022-09-19  7:53 baiyang
  2022-09-19 11:08 ` Szabolcs Nagy
  2022-09-19 13:43 ` Rich Felker
  0 siblings, 2 replies; 68+ messages in thread
From: baiyang @ 2022-09-19  7:53 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

Hi there,

As we have discussed at https://github.com/openwrt/openwrt/issues/10752. The malloc_usable_size() function in musl 1.2 (mallocng) seems to have some performance issues.

It caused realloc and free spends too long time for get the chunk size.

As we mentioned in the discussion, tcmalloc and some other allocators can also accurately obtain the size class corresponding to a memory block and its precise size, and it is also very fast at the same time.

Can we make some improvements to the existing malloc_usable_size algorithm in mallocng? This should significantly improve the performance of existing algorithms.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

[-- Attachment #2: Type: text/html, Size: 2921 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19  7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang
@ 2022-09-19 11:08 ` Szabolcs Nagy
  2022-09-19 12:36   ` Florian Weimer
  2022-09-19 13:43 ` Rich Felker
  1 sibling, 1 reply; 68+ messages in thread
From: Szabolcs Nagy @ 2022-09-19 11:08 UTC (permalink / raw)
  To: baiyang; +Cc: musl

* baiyang <baiyang@gmail.com> [2022-09-19 15:53:30 +0800]:

> Hi there,
> 
> As we have discussed at https://github.com/openwrt/openwrt/issues/10752. The malloc_usable_size() function in musl 1.2 (mallocng) seems to have some performance issues.
> 
> It caused realloc and free spends too long time for get the chunk size.

how is malloc_usable_size related to free and realloc performance?

> 
> As we mentioned in the discussion, tcmalloc and some other allocators can also accurately obtain the size class corresponding to a memory block and its precise size, and it is also very fast at the same time.

unlike musl those implementations don't return exact size nor have the
same security and memory fragmentation guarantees, so bad comparision.

tcmalloc:
  // Returns the actual number N of bytes reserved by tcmalloc for the pointer
  // p.  This number may be equal to or greater than the number of bytes
  // requested when p was allocated.
  //
  // This function is just useful for statistics collection.  The client must
  // *not* read or write from the extra bytes that are indicated by this call.

jemalloc:
      <para>The <function>malloc_usable_size()</function> function
      returns the usable size of the allocation pointed to by
      <parameter>ptr</parameter>.  The return value may be larger than the size
      that was requested during allocation.  The
      <function>malloc_usable_size()</function> function is not a
      mechanism for in-place <function>realloc()</function>; rather
      it is provided solely as a tool for introspection purposes.  Any
      discrepancy between the requested allocation size and the size reported
      by <function>malloc_usable_size()</function> should not be
      depended on, since such behavior is entirely implementation-dependent.


> 
> Can we make some improvements to the existing malloc_usable_size algorithm in mallocng? This should significantly improve the performance of existing algorithms.

can you give an actual interesting workload where that's
the bottleneck?

given that calling this function is almost always a bug
in production code, it's hard to see how it can cause
performance problems.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 11:08 ` Szabolcs Nagy
@ 2022-09-19 12:36   ` Florian Weimer
  2022-09-19 13:46     ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: Florian Weimer @ 2022-09-19 12:36 UTC (permalink / raw)
  To: baiyang; +Cc: musl

* Szabolcs Nagy:

> unlike musl those implementations don't return exact size nor have the
> same security and memory fragmentation guarantees, so bad comparision.
>
> tcmalloc:
>   // Returns the actual number N of bytes reserved by tcmalloc for the pointer
>   // p.  This number may be equal to or greater than the number of bytes
>   // requested when p was allocated.
>   //
>   // This function is just useful for statistics collection.  The client must
>   // *not* read or write from the extra bytes that are indicated by this call.
>
> jemalloc:
>       <para>The <function>malloc_usable_size()</function> function
>       returns the usable size of the allocation pointed to by
>       <parameter>ptr</parameter>.  The return value may be larger than the size
>       that was requested during allocation.  The
>       <function>malloc_usable_size()</function> function is not a
>       mechanism for in-place <function>realloc()</function>; rather
>       it is provided solely as a tool for introspection purposes.  Any
>       discrepancy between the requested allocation size and the size reported
>       by <function>malloc_usable_size()</function> should not be
>       depended on, since such behavior is entirely implementation-dependent.

These implementations are buggy or at least mis-documented.  The
interface contract is clearly that for that particular object, the extra
bytes in the allocation are available for reading and writing.  It is
not guaranteed that the allocator will always provide the same number of
extra bytes for the same requested size, but they must be there for the
allocation being examined.  It's even in the name of the function!

Thanks,
Florian


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19  7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang
  2022-09-19 11:08 ` Szabolcs Nagy
@ 2022-09-19 13:43 ` Rich Felker
  2022-09-19 17:32   ` baiyang
  1 sibling, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 13:43 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote:
> Hi there,
> 
> As we have discussed at
> https://github.com/openwrt/openwrt/issues/10752. The
> malloc_usable_size() function in musl 1.2 (mallocng) seems to have
> some performance issues.
> 
> It caused realloc and free spends too long time for get the chunk size.
> 
> As we mentioned in the discussion, tcmalloc and some other
> allocators can also accurately obtain the size class corresponding
> to a memory block and its precise size, and it is also very fast at
> the same time.
> 
> Can we make some improvements to the existing malloc_usable_size
> algorithm in mallocng? This should significantly improve the
> performance of existing algorithms.

Can you please start from a point of identifying the real-world case
in which you're hitting a performance degredation? Made-up tests are
generally not helpful and will almost always lead to focusing on the
wrong problem.

For now I'm going to focus on some things from the linked thread:

> > Considering that realloc itself contains a complete
> > malloc_usable_size (refer to here and here), So actually most
> > (66.7%) of the realloc time is spent doing malloc_usable_size.

In your test that increments the realloc size by one each iteration,
only one in every PAGESIZE calls has any real work to do. The rest do
nothing but set_size after obtaining the metadata on the object
they're acting on. It's completely expected that the runtime of these
will be dominated by obtaining the metadata; this isn't evidence of
anything wrong. And, moreover, it's almost surely a lot more than
66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap
syscalls and page faults accessing the new pages they produce.

> > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac
> > os x (malloc_size), and musl 1.1, even on low-end embedded
> > processors, the consumption of malloc_usable_size per 10 million
> > calls is mostly not more than a few hundred milliseconds.

It looks like mallocng's malloc_usable_size is taking around 150 ns
per call on your system, vs maybe 30-50 for others?

> > In addition, this very slow slab size acquisition algorithm also
> > needs to be called every time free (see here). So we believe it
> > should be the main reason for malloc/free and realloc performance
> > degradation in version 1.2.

Unless you have an application that's explicitly using
malloc_usable_size all over the place, it's highly unlikely that this
is the cause of your real-world performance problems. The vast
majority of reported problems with malloc performance have been in
multithreaded applications, where the dominating time cost is
fundamental: synchronization cost of having global consistency. There
you'll expect to find very similar performance figures from any other
allocator with global consistency, such as hardened_malloc.

If you're really having single-threaded performance problems that
aren't just in made-up benchmarks, please see if you can narrow down
the cause empirically rather than speculatively. For example, running
the program under perf and looking at where the time is being spent.

> > If we can improve its speed and make it close to implementations
> > like tcmalloc (tcmalloc can also accurately return the size of the
> > size class to which the chunk belongs), it should significantly
> > improve the performance of mallocng (at least in single-threaded
> > scenarios) .

tcmalloc is fast by not having global consistency, not being hardened
against memory errors like double-free and use-after-free, and not
avoiding fragmentation and excessive memory usage. Likewise for most
of the others. The run time costs in mallocng for looking up the
out-of-band metadata are largely fundamental to it being out-of-band
(not subject to direct falsification via typically exploitable
application bugs), size-efficient, 32-bit-compatible,
nommu-compatible, etc. Other approaches like in hardened_malloc can be
moderately more efficient to access the metadata, at the price of not
being at all amenable to small systems, which are a core goal of musl
we can't really disregard.

I can't say for sure there's not any room for optimization in the
metadata fetching though. Looking at the assembly output might be
informative, to see if we're doing anything that's making the compiler
emit gratuitously inefficient code.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 12:36   ` Florian Weimer
@ 2022-09-19 13:46     ` Rich Felker
  2022-09-19 13:53       ` James Y Knight
  2022-09-20  8:33       ` Florian Weimer
  0 siblings, 2 replies; 68+ messages in thread
From: Rich Felker @ 2022-09-19 13:46 UTC (permalink / raw)
  To: Florian Weimer; +Cc: baiyang, musl

On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > unlike musl those implementations don't return exact size nor have the
> > same security and memory fragmentation guarantees, so bad comparision.
> >
> > tcmalloc:
> >   // Returns the actual number N of bytes reserved by tcmalloc for the pointer
> >   // p.  This number may be equal to or greater than the number of bytes
> >   // requested when p was allocated.
> >   //
> >   // This function is just useful for statistics collection.  The client must
> >   // *not* read or write from the extra bytes that are indicated by this call.
> >
> > jemalloc:
> >       <para>The <function>malloc_usable_size()</function> function
> >       returns the usable size of the allocation pointed to by
> >       <parameter>ptr</parameter>.  The return value may be larger than the size
> >       that was requested during allocation.  The
> >       <function>malloc_usable_size()</function> function is not a
> >       mechanism for in-place <function>realloc()</function>; rather
> >       it is provided solely as a tool for introspection purposes.  Any
> >       discrepancy between the requested allocation size and the size reported
> >       by <function>malloc_usable_size()</function> should not be
> >       depended on, since such behavior is entirely implementation-dependent.
> 
> These implementations are buggy or at least mis-documented.  The
> interface contract is clearly that for that particular object, the extra
> bytes in the allocation are available for reading and writing.  It is
> not guaranteed that the allocator will always provide the same number of
> extra bytes for the same requested size, but they must be there for the
> allocation being examined.  It's even in the name of the function!

I'm not sure I understand what you're saying, but the core problem
that really can't be solved is potential discrepancy between the
malloc implementation's idea of usable and the compiler's. For
example:

	char *p = malloc(1);
	if (malloc_usable_size(p)>1) p[1] = 42;

will cause a compiler that's actively detecting UB to abort the
program when malloc_usable_size returns a value larger than 1.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 13:46     ` Rich Felker
@ 2022-09-19 13:53       ` James Y Knight
  2022-09-19 17:40         ` baiyang
  2022-09-20  8:33       ` Florian Weimer
  1 sibling, 1 reply; 68+ messages in thread
From: James Y Knight @ 2022-09-19 13:53 UTC (permalink / raw)
  To: musl; +Cc: Florian Weimer, baiyang

[-- Attachment #1: Type: text/plain, Size: 3248 bytes --]

Indeed. RedHat mentioned that problem in their recent post about
_FORTIFY_SOURCE=3, here
https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level

"""
_FORTIFY_SOURCE=3 revealed another pattern. Applications such as systemd
used malloc_usable_size to determine available space in objects and then
used the residual space. The glibc manual discourages this type of usage,
dictating that malloc_usable_size is for diagnostic purposes only. But
applications use the function as a hack to avoid reallocating buffers when
there is space in the underlying malloc chunk. The implementation of
malloc_usable_size needs to be fixed to return the allocated object size
instead of the chunk size in non-diagnostic use. Alternatively, another
solution is to deprecate the function. But that is a topic for discussion
by the glibc community.
"""

On Mon, Sep 19, 2022 at 9:47 AM Rich Felker <dalias@libc.org> wrote:

> On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote:
> > * Szabolcs Nagy:
> >
> > > unlike musl those implementations don't return exact size nor have the
> > > same security and memory fragmentation guarantees, so bad comparision.
> > >
> > > tcmalloc:
> > >   // Returns the actual number N of bytes reserved by tcmalloc for the
> pointer
> > >   // p.  This number may be equal to or greater than the number of
> bytes
> > >   // requested when p was allocated.
> > >   //
> > >   // This function is just useful for statistics collection.  The
> client must
> > >   // *not* read or write from the extra bytes that are indicated by
> this call.
> > >
> > > jemalloc:
> > >       <para>The <function>malloc_usable_size()</function> function
> > >       returns the usable size of the allocation pointed to by
> > >       <parameter>ptr</parameter>.  The return value may be larger than
> the size
> > >       that was requested during allocation.  The
> > >       <function>malloc_usable_size()</function> function is not a
> > >       mechanism for in-place <function>realloc()</function>; rather
> > >       it is provided solely as a tool for introspection purposes.  Any
> > >       discrepancy between the requested allocation size and the size
> reported
> > >       by <function>malloc_usable_size()</function> should not be
> > >       depended on, since such behavior is entirely
> implementation-dependent.
> >
> > These implementations are buggy or at least mis-documented.  The
> > interface contract is clearly that for that particular object, the extra
> > bytes in the allocation are available for reading and writing.  It is
> > not guaranteed that the allocator will always provide the same number of
> > extra bytes for the same requested size, but they must be there for the
> > allocation being examined.  It's even in the name of the function!
>
> I'm not sure I understand what you're saying, but the core problem
> that really can't be solved is potential discrepancy between the
> malloc implementation's idea of usable and the compiler's. For
> example:
>
>         char *p = malloc(1);
>         if (malloc_usable_size(p)>1) p[1] = 42;
>
> will cause a compiler that's actively detecting UB to abort the
> program when malloc_usable_size returns a value larger than 1.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 4150 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 13:43 ` Rich Felker
@ 2022-09-19 17:32   ` baiyang
  2022-09-19 18:15     ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-19 17:32 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 6175 bytes --]

Hi Rich,

Thanks for your reply.

> Unless you have an application that's explicitly using
> malloc_usable_size all over the place, it's highly unlikely that this
> is the cause of your real-world performance problems. 

1. Yes, we have a real scenario where `malloc_usable_size` is called frequently: we need to optimize the realloc experience. We add an extra parameter to realloc - minimalCopyBytes: it represents the actual size of data that needs to be copied after fallback to malloc-copy-free mode. We will judge whether to call realloc or complete malloc-memcpy-free by ourself based on factors such as the size of the data that realloc needs to copy (obtained through `malloc_usable_size`), the size that we actually need to copy when we doing malloc-memcpy-free ourself (minimalCopyBytes) and the chance of merging chunks (small blocks) or mremap (large blocks) in the underlayer realloc. So, this is a real scenario, we need to call `malloc_usable_size` frequently.

2. As I mentioned before, this isn't just a problem with `malloc_usable_size`, since we actually include a full `malloc_usable_size` procedure in both `realloc` and `free`, it actually slows down The speed of other calls such as `free` and `realloc`. So this problem actually slows down not only the `malloc_usable_size` call itself, but also the realloc and free calls.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-19 21:43
To: baiyang
CC: musl
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote:
> Hi there,
> 
> As we have discussed at
> https://github.com/openwrt/openwrt/issues/10752. The
> malloc_usable_size() function in musl 1.2 (mallocng) seems to have
> some performance issues.
> 
> It caused realloc and free spends too long time for get the chunk size.
> 
> As we mentioned in the discussion, tcmalloc and some other
> allocators can also accurately obtain the size class corresponding
> to a memory block and its precise size, and it is also very fast at
> the same time.
> 
> Can we make some improvements to the existing malloc_usable_size
> algorithm in mallocng? This should significantly improve the
> performance of existing algorithms.

Can you please start from a point of identifying the real-world case
in which you're hitting a performance degredation? Made-up tests are
generally not helpful and will almost always lead to focusing on the
wrong problem.

For now I'm going to focus on some things from the linked thread:

> > Considering that realloc itself contains a complete
> > malloc_usable_size (refer to here and here), So actually most
> > (66.7%) of the realloc time is spent doing malloc_usable_size.

In your test that increments the realloc size by one each iteration,
only one in every PAGESIZE calls has any real work to do. The rest do
nothing but set_size after obtaining the metadata on the object
they're acting on. It's completely expected that the runtime of these
will be dominated by obtaining the metadata; this isn't evidence of
anything wrong. And, moreover, it's almost surely a lot more than
66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap
syscalls and page faults accessing the new pages they produce.

> > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac
> > os x (malloc_size), and musl 1.1, even on low-end embedded
> > processors, the consumption of malloc_usable_size per 10 million
> > calls is mostly not more than a few hundred milliseconds.

It looks like mallocng's malloc_usable_size is taking around 150 ns
per call on your system, vs maybe 30-50 for others?

> > In addition, this very slow slab size acquisition algorithm also
> > needs to be called every time free (see here). So we believe it
> > should be the main reason for malloc/free and realloc performance
> > degradation in version 1.2.

Unless you have an application that's explicitly using
malloc_usable_size all over the place, it's highly unlikely that this
is the cause of your real-world performance problems. The vast
majority of reported problems with malloc performance have been in
multithreaded applications, where the dominating time cost is
fundamental: synchronization cost of having global consistency. There
you'll expect to find very similar performance figures from any other
allocator with global consistency, such as hardened_malloc.

If you're really having single-threaded performance problems that
aren't just in made-up benchmarks, please see if you can narrow down
the cause empirically rather than speculatively. For example, running
the program under perf and looking at where the time is being spent.

> > If we can improve its speed and make it close to implementations
> > like tcmalloc (tcmalloc can also accurately return the size of the
> > size class to which the chunk belongs), it should significantly
> > improve the performance of mallocng (at least in single-threaded
> > scenarios) .

tcmalloc is fast by not having global consistency, not being hardened
against memory errors like double-free and use-after-free, and not
avoiding fragmentation and excessive memory usage. Likewise for most
of the others. The run time costs in mallocng for looking up the
out-of-band metadata are largely fundamental to it being out-of-band
(not subject to direct falsification via typically exploitable
application bugs), size-efficient, 32-bit-compatible,
nommu-compatible, etc. Other approaches like in hardened_malloc can be
moderately more efficient to access the metadata, at the price of not
being at all amenable to small systems, which are a core goal of musl
we can't really disregard.

I can't say for sure there's not any room for optimization in the
metadata fetching though. Looking at the assembly output might be
informative, to see if we're doing anything that's making the compiler
emit gratuitously inefficient code.

Rich

[-- Attachment #2: Type: text/html, Size: 10894 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 13:53       ` James Y Knight
@ 2022-09-19 17:40         ` baiyang
  2022-09-19 18:14           ` Szabolcs Nagy
  2022-09-20  0:13           ` James Y Knight
  0 siblings, 2 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 17:40 UTC (permalink / raw)
  To: James Y Knight, musl; +Cc: Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 4264 bytes --]

Hi James,

I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).

On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499

I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: James Y Knight
Date: 2022-09-19 21:53
To: musl
CC: Florian Weimer; baiyang
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
Indeed. RedHat mentioned that problem in their recent post about _FORTIFY_SOURCE=3, here
https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level

"""
_FORTIFY_SOURCE=3 revealed another pattern. Applications such as systemd used malloc_usable_size to determine available space in objects and then used the residual space. The glibc manual discourages this type of usage, dictating that malloc_usable_size is for diagnostic purposes only. But applications use the function as a hack to avoid reallocating buffers when there is space in the underlying malloc chunk. The implementation of malloc_usable_size needs to be fixed to return the allocated object size instead of the chunk size in non-diagnostic use. Alternatively, another solution is to deprecate the function. But that is a topic for discussion by the glibc community.
"""

On Mon, Sep 19, 2022 at 9:47 AM Rich Felker <dalias@libc.org> wrote:
On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > unlike musl those implementations don't return exact size nor have the
> > same security and memory fragmentation guarantees, so bad comparision.
> >
> > tcmalloc:
> >   // Returns the actual number N of bytes reserved by tcmalloc for the pointer
> >   // p.  This number may be equal to or greater than the number of bytes
> >   // requested when p was allocated.
> >   //
> >   // This function is just useful for statistics collection.  The client must
> >   // *not* read or write from the extra bytes that are indicated by this call.
> >
> > jemalloc:
> >       <para>The <function>malloc_usable_size()</function> function
> >       returns the usable size of the allocation pointed to by
> >       <parameter>ptr</parameter>.  The return value may be larger than the size
> >       that was requested during allocation.  The
> >       <function>malloc_usable_size()</function> function is not a
> >       mechanism for in-place <function>realloc()</function>; rather
> >       it is provided solely as a tool for introspection purposes.  Any
> >       discrepancy between the requested allocation size and the size reported
> >       by <function>malloc_usable_size()</function> should not be
> >       depended on, since such behavior is entirely implementation-dependent.
> 
> These implementations are buggy or at least mis-documented.  The
> interface contract is clearly that for that particular object, the extra
> bytes in the allocation are available for reading and writing.  It is
> not guaranteed that the allocator will always provide the same number of
> extra bytes for the same requested size, but they must be there for the
> allocation being examined.  It's even in the name of the function!

I'm not sure I understand what you're saying, but the core problem
that really can't be solved is potential discrepancy between the
malloc implementation's idea of usable and the compiler's. For
example:

        char *p = malloc(1);
        if (malloc_usable_size(p)>1) p[1] = 42;

will cause a compiler that's actively detecting UB to abort the
program when malloc_usable_size returns a value larger than 1.

Rich

[-- Attachment #2: Type: text/html, Size: 8271 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 17:40         ` baiyang
@ 2022-09-19 18:14           ` Szabolcs Nagy
  2022-09-19 18:40             ` baiyang
                               ` (2 more replies)
  2022-09-20  0:13           ` James Y Knight
  1 sibling, 3 replies; 68+ messages in thread
From: Szabolcs Nagy @ 2022-09-19 18:14 UTC (permalink / raw)
  To: baiyang; +Cc: James Y Knight, musl, Florian Weimer

* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
> 
> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
> 
> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.

obviously internally the implementation can use the internal chunk size...

GetSize(p) is not the exact size (that the user allocated) but an internal
size (which may be larger) and that must not be exposed *outside* of the
malloc implementation (other than for diagnostic purposes).

you can have 2 views:

(1) tcmalloc and jemalloc are buggy because they expose an internal
    that must not be exposed (becaues it can break user code).

(2) user code is buggy if it uses malloc_usable_size for any purpose
    other than diagnostic/statistics (because other uses are broken
    on many implementations).

either way the brokenness you want to support is a security hazard
and you are lucky that musl saves the day: it works hard not to
expose internal sizes so the code you seem to care about can operate
safely (which is not true on tcmalloc and jemalloc: the compiler
may break that code).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 17:32   ` baiyang
@ 2022-09-19 18:15     ` Rich Felker
  2022-09-19 18:44       ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 18:15 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 01:32:31AM +0800, baiyang wrote:
> Hi Rich,
> 
> Thanks for your reply.
> 
> > Unless you have an application that's explicitly using
> > malloc_usable_size all over the place, it's highly unlikely that this
> > is the cause of your real-world performance problems. 
> 
> 1. Yes, we have a real scenario where `malloc_usable_size` is called
> frequently: we need to optimize the realloc experience. We add an
> extra parameter to realloc - minimalCopyBytes: it represents the
> actual size of data that needs to be copied after fallback to
> malloc-copy-free mode. We will judge whether to call realloc or
> complete malloc-memcpy-free by ourself based on factors such as the
> size of the data that realloc needs to copy (obtained through
> `malloc_usable_size`), the size that we actually need to copy when
> we doing malloc-memcpy-free ourself (minimalCopyBytes) and the
> chance of merging chunks (small blocks) or mremap (large blocks) in
> the underlayer realloc. So, this is a real scenario, we need to call
> `malloc_usable_size` frequently.

Is there a reason you're relying on an unreliable and nonstandard
function (malloc_usable_size) to do this rather than your program
keeping track of its own knowledge of the allocated size? This is what
the C language expects you to do. For example if you have a structure
that contains a pointer to a dynamically sized buffer, normally you
store the size in a size_t member right next to that pointer, allowing
you to make these kind of decisions without having to probe anything.

> 2. As I mentioned before, this isn't just a problem with
> `malloc_usable_size`, since we actually include a full
> `malloc_usable_size` procedure in both `realloc` and `free`, it
> actually slows down The speed of other calls such as `free` and
> `realloc`. So this problem actually slows down not only the
> `malloc_usable_size` call itself, but also the realloc and free
> calls.

If this is affecting you too, that's a separate issue. But I can't
tell from what you've reported so far whether you're just claiming
this on a theoretical basis or whether you're actually experiencing
unacceptable performance.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 18:14           ` Szabolcs Nagy
@ 2022-09-19 18:40             ` baiyang
  2022-09-19 19:07             ` Gabriel Ravier
  2022-09-19 20:46             ` Nat!
  2 siblings, 0 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 18:40 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: James Y Knight, musl, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 2582 bytes --]

> GetSize(p) is not the exact size (that the user allocated) but an internal size (which may be larger)

Yes, as I mentioned in another email, we just need this "internal size".

The value returned by malloc_usable_size() may be greater than the requested size of the allocation.

Also, I don't think there is any ambiguity in the manual pages of each platform regarding this "internal size": The value returned by malloc_usable_size() may be greater than the requested size of the allocation -- that's exactly what we want.

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Szabolcs Nagy
Date: 2022-09-20 02:14
To: baiyang
CC: James Y Knight; musl; Florian Weimer
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
> 
> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
> 
> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.

obviously internally the implementation can use the internal chunk size...

GetSize(p) is not the exact size (that the user allocated) but an internal
size (which may be larger) and that must not be exposed *outside* of the
malloc implementation (other than for diagnostic purposes).

you can have 2 views:

(1) tcmalloc and jemalloc are buggy because they expose an internal
    that must not be exposed (becaues it can break user code).

(2) user code is buggy if it uses malloc_usable_size for any purpose
    other than diagnostic/statistics (because other uses are broken
    on many implementations).

either way the brokenness you want to support is a security hazard
and you are lucky that musl saves the day: it works hard not to
expose internal sizes so the code you seem to care about can operate
safely (which is not true on tcmalloc and jemalloc: the compiler
may break that code).

[-- Attachment #2: Type: text/html, Size: 5428 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 18:15     ` Rich Felker
@ 2022-09-19 18:44       ` baiyang
  2022-09-19 19:18         ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-19 18:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4190 bytes --]

> Is there a reason you're relying on an unreliable and nonstandard
> function (malloc_usable_size) to do this rather than your program
> keeping track of its own knowledge of the allocated size? This is what
> the C language expects you to do. For example if you have a structure
> that contains a pointer to a dynamically sized buffer, normally you
> store the size in a size_t member right next to that pointer, allowing
> you to make these kind of decisions without having to probe anything.

Yes, as I have been said, by comparing the number of bytes that realloc needs to copy in the worst case (the return value of malloc_usable_size), and the number of bytes we actually need to copy, we can optimize the performance of realloc in real scenarios and avoid unnecessary memory copies.

In fact, in scenarios including glibc, tcmalloc, windows crt, mac os x, uclibc and musl 1.1, we did achieve good optimization results.

On the other hand, of course we keep the number of bytes actually allocated, but it doesn't really reflect objectively the number of bytes to be copied by realloc when the memcpy actually occurs. And malloc_usable_size() more accurately reflects how many bytes realloc needs to copy when it degenerates back to malloc-memcpy-free mode.

So our expectation is as mentioned in the man page for linux, mac os or windows: "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation" or "The memory block size is always at least as large as the allocation it backs, **and may be larger**." - We expect to get its internal size to evaluate the cost of memory copying.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 02:15
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 01:32:31AM +0800, baiyang wrote:
> Hi Rich,
> 
> Thanks for your reply.
> 
> > Unless you have an application that's explicitly using
> > malloc_usable_size all over the place, it's highly unlikely that this
> > is the cause of your real-world performance problems. 
> 
> 1. Yes, we have a real scenario where `malloc_usable_size` is called
> frequently: we need to optimize the realloc experience. We add an
> extra parameter to realloc - minimalCopyBytes: it represents the
> actual size of data that needs to be copied after fallback to
> malloc-copy-free mode. We will judge whether to call realloc or
> complete malloc-memcpy-free by ourself based on factors such as the
> size of the data that realloc needs to copy (obtained through
> `malloc_usable_size`), the size that we actually need to copy when
> we doing malloc-memcpy-free ourself (minimalCopyBytes) and the
> chance of merging chunks (small blocks) or mremap (large blocks) in
> the underlayer realloc. So, this is a real scenario, we need to call
> `malloc_usable_size` frequently.

Is there a reason you're relying on an unreliable and nonstandard
function (malloc_usable_size) to do this rather than your program
keeping track of its own knowledge of the allocated size? This is what
the C language expects you to do. For example if you have a structure
that contains a pointer to a dynamically sized buffer, normally you
store the size in a size_t member right next to that pointer, allowing
you to make these kind of decisions without having to probe anything.

> 2. As I mentioned before, this isn't just a problem with
> `malloc_usable_size`, since we actually include a full
> `malloc_usable_size` procedure in both `realloc` and `free`, it
> actually slows down The speed of other calls such as `free` and
> `realloc`. So this problem actually slows down not only the
> `malloc_usable_size` call itself, but also the realloc and free
> calls.

If this is affecting you too, that's a separate issue. But I can't
tell from what you've reported so far whether you're just claiming
this on a theoretical basis or whether you're actually experiencing
unacceptable performance.

[-- Attachment #2: Type: text/html, Size: 8441 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 18:14           ` Szabolcs Nagy
  2022-09-19 18:40             ` baiyang
@ 2022-09-19 19:07             ` Gabriel Ravier
  2022-09-19 19:21               ` Rich Felker
  2022-09-19 20:46             ` Nat!
  2 siblings, 1 reply; 68+ messages in thread
From: Gabriel Ravier @ 2022-09-19 19:07 UTC (permalink / raw)
  To: baiyang, James Y Knight, musl, Florian Weimer, dalias

On 9/19/22 20:14, Szabolcs Nagy wrote:
> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
>>
>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>>
>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
> obviously internally the implementation can use the internal chunk size...
>
> GetSize(p) is not the exact size (that the user allocated) but an internal
> size (which may be larger) and that must not be exposed *outside* of the
> malloc implementation (other than for diagnostic purposes).
>
> you can have 2 views:
>
> (1) tcmalloc and jemalloc are buggy because they expose an internal
>      that must not be exposed (becaues it can break user code).
>
> (2) user code is buggy if it uses malloc_usable_size for any purpose
>      other than diagnostic/statistics (because other uses are broken
>      on many implementations).
>
> either way the brokenness you want to support is a security hazard
> and you are lucky that musl saves the day: it works hard not to
> expose internal sizes so the code you seem to care about can operate
> safely (which is not true on tcmalloc and jemalloc: the compiler
> may break that code).

While I would agree that using malloc_usable_size is generally not a 
great idea (it's at most acceptable as a small micro-optimization, but I 
would only ever expect it to be seen in very well-tested code in very 
hot loops, as it is indeed quite easily misused), it seems like a bit of 
a stretch to say that all of:

- sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)

- systemd 
(https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h , 
along with all files using MALLOC_SIZEOF_SAFE, i.e. 
src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, 
src/basic/memory-util.h, src/basic/recurse-dir.c, 
src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, 
src/shared/journal-importer.c, src/shared/varlink.c, 
src/test/test-alloc-util.c and src/test/test-compress.c)

- rocksdb 
(https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc 
, along with at least 20 other uses)

- folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)

- lzham_codec 
(https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)

- quickjs 
(https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)

- redis (https://github.com/redis/redis/blob/unstable/src/networking.c, 
along with a few other uses elsewhere)

along with so many more well-known projects that I've given up on 
listing them, are all buggy because of their usage of malloc_usable_size...


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 18:44       ` baiyang
@ 2022-09-19 19:18         ` Rich Felker
  2022-09-19 19:45           ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 19:18 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 02:44:58AM +0800, baiyang wrote:
> > Is there a reason you're relying on an unreliable and nonstandard
> > function (malloc_usable_size) to do this rather than your program
> > keeping track of its own knowledge of the allocated size? This is what
> > the C language expects you to do. For example if you have a structure
> > that contains a pointer to a dynamically sized buffer, normally you
> > store the size in a size_t member right next to that pointer, allowing
> > you to make these kind of decisions without having to probe anything.
> 
> Yes, as I have been said, by comparing the number of bytes that
> realloc needs to copy in the worst case (the return value of
> malloc_usable_size), and the number of bytes we actually need to
> copy, we can optimize the performance of realloc in real scenarios
> and avoid unnecessary memory copies.

You can do exactly the same keeping track of the size yourself. The
only correct value malloc_usable_size can return is the value you
passed to the allocator. If it's returning a different size, your
malloc implementation has a problem that will make you commit UB when
you use the result of malloc_usable_size. Many real-world ones do have
this problem.

> In fact, in scenarios including glibc, tcmalloc, windows crt, mac os
> x, uclibc and musl 1.1, we did achieve good optimization results.
> 
> On the other hand, of course we keep the number of bytes actually
> allocated, but it doesn't really reflect objectively the number of
> bytes to be copied by realloc when the memcpy actually occurs. And
> malloc_usable_size() more accurately reflects how many bytes realloc
> needs to copy when it degenerates back to malloc-memcpy-free mode.

I don't understand your claim here. The size you would store is
exactly the size that realloc would have to copy. It if copies more,
that's just realloc being inefficient, but the difference is not going
to be material anyway.

> So our expectation is as mentioned in the man page for linux, mac os
> or windows: "The value returned by malloc_usable_size() may be
> **greater than** the requested size of the allocation" or "The
> memory block size is always at least as large as the allocation it
> backs, **and may be larger**." - We expect to get its internal size
> to evaluate the cost of memory copying.

It's sounding more and more like you did premature optimization
without measuring any of this, since there is *no way* the possible
amount of excess copying a realloc implementation might make
internally could cost more than an extra external function call to
malloc_usable_size (even if it did nothing but return).

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 19:07             ` Gabriel Ravier
@ 2022-09-19 19:21               ` Rich Felker
  2022-09-19 21:02                 ` Gabriel Ravier
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 19:21 UTC (permalink / raw)
  To: Gabriel Ravier; +Cc: baiyang, James Y Knight, musl, Florian Weimer

On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote:
> On 9/19/22 20:14, Szabolcs Nagy wrote:
> >* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
> >>I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
> >>
> >>On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
> >>
> >>I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
> >obviously internally the implementation can use the internal chunk size...
> >
> >GetSize(p) is not the exact size (that the user allocated) but an internal
> >size (which may be larger) and that must not be exposed *outside* of the
> >malloc implementation (other than for diagnostic purposes).
> >
> >you can have 2 views:
> >
> >(1) tcmalloc and jemalloc are buggy because they expose an internal
> >     that must not be exposed (becaues it can break user code).
> >
> >(2) user code is buggy if it uses malloc_usable_size for any purpose
> >     other than diagnostic/statistics (because other uses are broken
> >     on many implementations).
> >
> >either way the brokenness you want to support is a security hazard
> >and you are lucky that musl saves the day: it works hard not to
> >expose internal sizes so the code you seem to care about can operate
> >safely (which is not true on tcmalloc and jemalloc: the compiler
> >may break that code).
> 
> While I would agree that using malloc_usable_size is generally not a
> great idea (it's at most acceptable as a small micro-optimization,
> but I would only ever expect it to be seen in very well-tested code
> in very hot loops, as it is indeed quite easily misused), it seems
> like a bit of a stretch to say that all of:
> 
> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)
> 
> - systemd
> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h
> , along with all files using MALLOC_SIZEOF_SAFE, i.e.
> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c,
> src/basic/memory-util.h, src/basic/recurse-dir.c,
> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c,
> src/shared/journal-importer.c, src/shared/varlink.c,
> src/test/test-alloc-util.c and src/test/test-compress.c)
> 
> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc
> , along with at least 20 other uses)
> 
> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)
> 
> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)
> 
> - quickjs
> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)
> 
> - redis
> (https://github.com/redis/redis/blob/unstable/src/networking.c,
> along with a few other uses elsewhere)
> 
> along with so many more well-known projects that I've given up on
> listing them, are all buggy because of their usage of
> malloc_usable_size...

Depending on how you interpret the contract of malloc_usable_size
(which was historically ambigious), either (1) or (2) above is
*necessarily* true. It's not a matter of opinion just logical
consequences of the choice you make.

Moreover, it's not at all a stretch to say 7+ popular projects have
gigantic UB they don't care to fix. The whole story of musl has been
finding *hundreds* of such projects, and eventually getting lots of
them fixed.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 19:18         ` Rich Felker
@ 2022-09-19 19:45           ` baiyang
  2022-09-19 20:07             ` Rich Felker
  2022-09-19 20:17             ` Joakim Sindholt
  0 siblings, 2 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 19:45 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 6207 bytes --]

> The only correct value malloc_usable_size can return is the value you passed to the allocator. 

I don't think so, see:

Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation".

Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**."

FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation".

These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. 

Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation.

> It's sounding more and more like you did premature optimization
> without measuring any of this, since there is *no way* the possible
> amount of excess copying a realloc implementation might make
> internally could cost more than an extra external function call to
> malloc_usable_size (even if it did nothing but return).

As I said before:
> We have a real scenario where `malloc_usable_size` is called frequently: we need to optimize the realloc experience. We add an extra parameter to realloc - minimalCopyBytes: it represents the actual size of data that needs to be copied after fallback to malloc-copy-free mode. We will judge whether to call realloc or complete malloc-memcpy-free by ourself based on factors such as the size of the data that realloc needs to copy (obtained through `malloc_usable_size`), the size that we actually need to copy when we doing malloc-memcpy-free ourself (minimalCopyBytes) and the chance of merging chunks (small blocks) or mremap (large blocks) in the underlayer realloc. So, this is a real scenario, we need to call `malloc_usable_size` frequently.

Example: We allocate a block of 500KB (malloc actually allocated 512KB) and want to extend it to 576KB via realloc. At this point realloc may downgrade back to the inefficient malloc(756KB), memcpy(512KB) and free(512KB) modes. But the real situation at this time may be that we only need to keep the first 4KB of content in 500KB, so we comprehensively evaluate the cost (including the possibility of realloc using block merging like in musl 1.1, and techniques like mremap to avoid copying) to decide whether to complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves are more cost-effective.

Such optimizations have measurable and significant effects on our practical applications in each of the above libc environments.

In this scenario, we need to get the 512KB actually allocated by malloc through malloc_usable_size instead of the 500KB length we saved ourselves.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 03:18
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 02:44:58AM +0800, baiyang wrote:
> > Is there a reason you're relying on an unreliable and nonstandard
> > function (malloc_usable_size) to do this rather than your program
> > keeping track of its own knowledge of the allocated size? This is what
> > the C language expects you to do. For example if you have a structure
> > that contains a pointer to a dynamically sized buffer, normally you
> > store the size in a size_t member right next to that pointer, allowing
> > you to make these kind of decisions without having to probe anything.
> 
> Yes, as I have been said, by comparing the number of bytes that
> realloc needs to copy in the worst case (the return value of
> malloc_usable_size), and the number of bytes we actually need to
> copy, we can optimize the performance of realloc in real scenarios
> and avoid unnecessary memory copies.

You can do exactly the same keeping track of the size yourself. The
only correct value malloc_usable_size can return is the value you
passed to the allocator. If it's returning a different size, your
malloc implementation has a problem that will make you commit UB when
you use the result of malloc_usable_size. Many real-world ones do have
this problem.

> In fact, in scenarios including glibc, tcmalloc, windows crt, mac os
> x, uclibc and musl 1.1, we did achieve good optimization results.
> 
> On the other hand, of course we keep the number of bytes actually
> allocated, but it doesn't really reflect objectively the number of
> bytes to be copied by realloc when the memcpy actually occurs. And
> malloc_usable_size() more accurately reflects how many bytes realloc
> needs to copy when it degenerates back to malloc-memcpy-free mode.

I don't understand your claim here. The size you would store is
exactly the size that realloc would have to copy. It if copies more,
that's just realloc being inefficient, but the difference is not going
to be material anyway.

> So our expectation is as mentioned in the man page for linux, mac os
> or windows: "The value returned by malloc_usable_size() may be
> **greater than** the requested size of the allocation" or "The
> memory block size is always at least as large as the allocation it
> backs, **and may be larger**." - We expect to get its internal size
> to evaluate the cost of memory copying.

It's sounding more and more like you did premature optimization
without measuring any of this, since there is *no way* the possible
amount of excess copying a realloc implementation might make
internally could cost more than an extra external function call to
malloc_usable_size (even if it did nothing but return).

Rich

[-- Attachment #2: Type: text/html, Size: 14282 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 19:45           ` baiyang
@ 2022-09-19 20:07             ` Rich Felker
  2022-09-19 20:17               ` baiyang
  2022-09-19 20:17             ` Joakim Sindholt
  1 sibling, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 20:07 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 03:45:35AM +0800, baiyang wrote:
> > The only correct value malloc_usable_size can return is the value
> > you passed to the allocator.
> 
> I don't think so, see:
> 
> Linux man page:
> https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html -
> "The value returned by malloc_usable_size() may be **greater than**
> the requested size of the allocation".
> 
> Mac OS X man page:
> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html
> - "The memory block size is always at least as large as the
> allocation it backs, **and may be larger**."
> 
> FreeBSD man page:
> https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html
> - "The return value **may be larger** than the size that was
> requested during allocation".
> 
> These official man pages clearly state that the return value of
> malloc_usable_size is the size of the memory block allocated
> internally, not the size submitted by the user.
> 
> Instead, we didn't find any documentation saying that the return
> value of malloc_usable_size must be the size submitted by the user
> to be correct. Please correct me if you have the relevant
> documentation.

OK, I didn't state that precisely. There are two conflicting claims
for what the malloc_usable_size contract is. If it's allowed to return
some value larger than the size you requested, then the size returned
is not actually "usable" and there's basically nothing useful you can
do with the function.

> > It's sounding more and more like you did premature optimization
> > without measuring any of this, since there is *no way* the possible
> > amount of excess copying a realloc implementation might make
> > internally could cost more than an extra external function call to
> > malloc_usable_size (even if it did nothing but return).
> 
> As I said before:
> > We have a real scenario where `malloc_usable_size` is called
> > frequently: we need to optimize the realloc experience. We add an
> > extra parameter to realloc - minimalCopyBytes: it represents the
> > actual size of data that needs to be copied after fallback to
> > malloc-copy-free mode. We will judge whether to call realloc or
> > complete malloc-memcpy-free by ourself based on factors such as
> > the size of the data that realloc needs to copy (obtained through
> > `malloc_usable_size`), the size that we actually need to copy when
> > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the
> > chance of merging chunks (small blocks) or mremap (large blocks)
> > in the underlayer realloc. So, this is a real scenario, we need to
> > call `malloc_usable_size` frequently.
> 
> Example: We allocate a block of 500KB (malloc actually allocated
> 512KB) and want to extend it to 576KB via realloc. At this point
> realloc may downgrade back to the inefficient malloc(756KB),
> memcpy(512KB) and free(512KB) modes.

Clearly you did not measure this, because with basically any
real-world malloc, it will call mremap and move the memory via
MMU-level remapping, with no copying involved whatsoever.

> But the real situation at this
> time may be that we only need to keep the first 4KB of content in
> 500KB, so we comprehensively evaluate the cost (including the
> possibility of realloc using block merging like in musl 1.1, and
> techniques like mremap to avoid copying) to decide whether to
> complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves
> are more cost-effective.

You could have achieved exactly the same thing by keeping your own
knowledge that you allocated 500kB. But it would still be
significantly slower, because mmap+memcpy+munmap (2 syscalls) is
slower than mremap (1 syscall).

> Such optimizations have measurable and significant effects on our
> practical applications in each of the above libc environments.
> 
> In this scenario, we need to get the 512KB actually allocated by
> malloc through malloc_usable_size instead of the 500KB length we
> saved ourselves.

No you don't. Either number works just as well (or rather just as
poorly).

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 19:45           ` baiyang
  2022-09-19 20:07             ` Rich Felker
@ 2022-09-19 20:17             ` Joakim Sindholt
  2022-09-19 20:33               ` baiyang
  1 sibling, 1 reply; 68+ messages in thread
From: Joakim Sindholt @ 2022-09-19 20:17 UTC (permalink / raw)
  To: musl

On Tue, 20 Sep 2022 03:45:35 +0800, baiyang <baiyang@gmail.com> wrote:
> > The only correct value malloc_usable_size can return is the value you passed to the allocator. 
> 
> I don't think so, see:
> 
> Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation".
> 
> Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**."
> 
> FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation".
> 
> These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. 
> 
> Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation.

It's not that malloc_usable_size must return the size originally
submitted by the user but that if it doesn't and you take that as an
invitation to exceed the original size allocated you will hit UB.

Simple case:
https://gcc.godbolt.org/z/5E65rr95W
Real world example:
https://github.com/systemd/systemd/issues/22801

And the reason why is pretty simple:
http://port70.net/~nsz/c/c11/n1570.html#7.22.3.4p2
> The malloc function allocates space for an object whose size is
> specified by size and whose value is indeterminate.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:07             ` Rich Felker
@ 2022-09-19 20:17               ` baiyang
  2022-09-19 20:28                 ` Rich Felker
  2022-09-19 22:02                 ` Quentin Rameau
  0 siblings, 2 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 20:17 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 5609 bytes --]

> Clearly you did not measure this, because with basically any real-world malloc, it will call mremap and move the memory via MMU-level remapping, with no copying involved whatsoever. 

OK, I've been said multiple times: 
> ...and the chance of merging chunks (small blocks) or **mremap** (large blocks) in the underlayer realloc.
> ...evaluate the cost (including the possibility of realloc using block merging like in musl 1.1, and techniques like **mremap** to avoid copying)

Therefore, we determined that the possibility of each memory allocator calling mremap in different situations was specifically considered on the LINUX platform.

And I mentioned it before: we did massively optimize performance in real-world applications. These are not the focus of our discussion.

So now it is certain that in musl mallocng:
1. malloc_usable_size will always just return the size submitted by the user to malloc, not the actual size allocated inside it, right?
2. We have no plans to improve malloc_usable_size performance yet, right?

thanks

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Rich Felker
Date: 2022-09-20 04:07
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 03:45:35AM +0800, baiyang wrote:
> > The only correct value malloc_usable_size can return is the value
> > you passed to the allocator.
> 
> I don't think so, see:
> 
> Linux man page:
> https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html -
> "The value returned by malloc_usable_size() may be **greater than**
> the requested size of the allocation".
> 
> Mac OS X man page:
> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html
> - "The memory block size is always at least as large as the
> allocation it backs, **and may be larger**."
> 
> FreeBSD man page:
> https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html
> - "The return value **may be larger** than the size that was
> requested during allocation".
> 
> These official man pages clearly state that the return value of
> malloc_usable_size is the size of the memory block allocated
> internally, not the size submitted by the user.
> 
> Instead, we didn't find any documentation saying that the return
> value of malloc_usable_size must be the size submitted by the user
> to be correct. Please correct me if you have the relevant
> documentation.
 
OK, I didn't state that precisely. There are two conflicting claims
for what the malloc_usable_size contract is. If it's allowed to return
some value larger than the size you requested, then the size returned
is not actually "usable" and there's basically nothing useful you can
do with the function.
 
> > It's sounding more and more like you did premature optimization
> > without measuring any of this, since there is *no way* the possible
> > amount of excess copying a realloc implementation might make
> > internally could cost more than an extra external function call to
> > malloc_usable_size (even if it did nothing but return).
> 
> As I said before:
> > We have a real scenario where `malloc_usable_size` is called
> > frequently: we need to optimize the realloc experience. We add an
> > extra parameter to realloc - minimalCopyBytes: it represents the
> > actual size of data that needs to be copied after fallback to
> > malloc-copy-free mode. We will judge whether to call realloc or
> > complete malloc-memcpy-free by ourself based on factors such as
> > the size of the data that realloc needs to copy (obtained through
> > `malloc_usable_size`), the size that we actually need to copy when
> > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the
> > chance of merging chunks (small blocks) or mremap (large blocks)
> > in the underlayer realloc. So, this is a real scenario, we need to
> > call `malloc_usable_size` frequently.
> 
> Example: We allocate a block of 500KB (malloc actually allocated
> 512KB) and want to extend it to 576KB via realloc. At this point
> realloc may downgrade back to the inefficient malloc(756KB),
> memcpy(512KB) and free(512KB) modes.
 
Clearly you did not measure this, because with basically any
real-world malloc, it will call mremap and move the memory via
MMU-level remapping, with no copying involved whatsoever.
 
> But the real situation at this
> time may be that we only need to keep the first 4KB of content in
> 500KB, so we comprehensively evaluate the cost (including the
> possibility of realloc using block merging like in musl 1.1, and
> techniques like mremap to avoid copying) to decide whether to
> complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves
> are more cost-effective.
 
You could have achieved exactly the same thing by keeping your own
knowledge that you allocated 500kB. But it would still be
significantly slower, because mmap+memcpy+munmap (2 syscalls) is
slower than mremap (1 syscall).
 
> Such optimizations have measurable and significant effects on our
> practical applications in each of the above libc environments.
> 
> In this scenario, we need to get the 512KB actually allocated by
> malloc through malloc_usable_size instead of the 500KB length we
> saved ourselves.
 
No you don't. Either number works just as well (or rather just as
poorly).
 
Rich

[-- Attachment #2: Type: text/html, Size: 10503 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:17               ` baiyang
@ 2022-09-19 20:28                 ` Rich Felker
  2022-09-19 20:38                   ` baiyang
  2022-09-19 22:02                 ` Quentin Rameau
  1 sibling, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 20:28 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 04:17:20AM +0800, baiyang wrote:
> > Clearly you did not measure this, because with basically any
> > real-world malloc, it will call mremap and move the memory via
> > MMU-level remapping, with no copying involved whatsoever.
> 
> OK, I've been said multiple times: 
> > ...and the chance of merging chunks (small blocks) or **mremap**
> > (large blocks) in the underlayer realloc.

To my knowledge there is no allocator which does chunk merging on 500k
chunks in the default configuration.

> > ...evaluate the cost (including the possibility of realloc using
> > block merging like in musl 1.1, and techniques like **mremap** to
> > avoid copying)

musl 1.1.x never did block merging on objects of this size.

> Therefore, we determined that the possibility of each memory
> allocator calling mremap in different situations was specifically
> considered on the LINUX platform.
> 
> And I mentioned it before: we did massively optimize performance in
> real-world applications. These are not the focus of our discussion.
> 
> So now it is certain that in musl mallocng:
> 1. malloc_usable_size will always just return the size submitted by
> the user to malloc, not the actual size allocated inside it, right?

Yes.

> 2. We have no plans to improve malloc_usable_size performance yet, right?

In the absence of a concrete, quantitative report of how it's
impacting performance, definitely no.

Even if there is such a report, if the source of the problem is that
you're gratuitously trying to second-guess realloc and making your
program slower, rather than just calling realloc, it's not a very
interesting report. malloc_usable_size is *not* a function whose use
is recommended and it's only there at all because we can't remove an
interface once adding it.

If the problem is actually the size determination in realloc and free,
and if you've *measured* that and can make a quantitative report on
how it affected *real world usage* not a made-up benchmark, then that
is potentially actionable.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:17             ` Joakim Sindholt
@ 2022-09-19 20:33               ` baiyang
  0 siblings, 0 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 20:33 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

> you will hit UB
Thanks for the information, but:
1. As we have discussed in other emails, we do not use malloc_usage_size as such.
2. This is most likely a problem with gcc's corresponding checking mechanism, rather than using glibc's malloc_usable_size() in this way, see: https://gcc.godbolt.org/z/qhqheTqcz
 
--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Joakim Sindholt
Date: 2022-09-20 04:17
To: musl
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, 20 Sep 2022 03:45:35 +0800, baiyang <baiyang@gmail.com> wrote:
> > The only correct value malloc_usable_size can return is the value you passed to the allocator. 
> 
> I don't think so, see:
> 
> Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation".
> 
> Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**."
> 
> FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation".
> 
> These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. 
> 
> Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation.
 
It's not that malloc_usable_size must return the size originally
submitted by the user but that if it doesn't and you take that as an
invitation to exceed the original size allocated you will hit UB.
 
Simple case:
https://gcc.godbolt.org/z/5E65rr95W
Real world example:
https://github.com/systemd/systemd/issues/22801
 
And the reason why is pretty simple:
http://port70.net/~nsz/c/c11/n1570.html#7.22.3.4p2
> The malloc function allocates space for an object whose size is
> specified by size and whose value is indeterminate.

[-- Attachment #2: Type: text/html, Size: 4786 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:28                 ` Rich Felker
@ 2022-09-19 20:38                   ` baiyang
  0 siblings, 0 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 20:38 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 2964 bytes --]

> To my knowledge there is no allocator which does chunk merging on 500k chunks in the default configuration.
It can be ANY size, "500KB" is just an example.

> musl 1.1.x never did block merging on objects of this size.
Yes, we known it, and we will evaluate according to the specific situation of the operating system, libc implementation and allocation size. Again, this is not the point of this discussion.

> In the absence of a concrete, quantitative report of how it's impacting performance, definitely no.
OK, I understood and thanks for your time.
 
--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Rich Felker
Date: 2022-09-20 04:28
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 04:17:20AM +0800, baiyang wrote:
> > Clearly you did not measure this, because with basically any
> > real-world malloc, it will call mremap and move the memory via
> > MMU-level remapping, with no copying involved whatsoever.
> 
> OK, I've been said multiple times: 
> > ...and the chance of merging chunks (small blocks) or **mremap**
> > (large blocks) in the underlayer realloc.
 
To my knowledge there is no allocator which does chunk merging on 500k
chunks in the default configuration.
 
> > ...evaluate the cost (including the possibility of realloc using
> > block merging like in musl 1.1, and techniques like **mremap** to
> > avoid copying)
 
musl 1.1.x never did block merging on objects of this size.
 
> Therefore, we determined that the possibility of each memory
> allocator calling mremap in different situations was specifically
> considered on the LINUX platform.
> 
> And I mentioned it before: we did massively optimize performance in
> real-world applications. These are not the focus of our discussion.
> 
> So now it is certain that in musl mallocng:
> 1. malloc_usable_size will always just return the size submitted by
> the user to malloc, not the actual size allocated inside it, right?
 
Yes.
 
> 2. We have no plans to improve malloc_usable_size performance yet, right?
 
In the absence of a concrete, quantitative report of how it's
impacting performance, definitely no.
 
Even if there is such a report, if the source of the problem is that
you're gratuitously trying to second-guess realloc and making your
program slower, rather than just calling realloc, it's not a very
interesting report. malloc_usable_size is *not* a function whose use
is recommended and it's only there at all because we can't remove an
interface once adding it.
 
If the problem is actually the size determination in realloc and free,
and if you've *measured* that and can make a quantitative report on
how it affected *real world usage* not a made-up benchmark, then that
is potentially actionable.
 
Rich

[-- Attachment #2: Type: text/html, Size: 6237 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 18:14           ` Szabolcs Nagy
  2022-09-19 18:40             ` baiyang
  2022-09-19 19:07             ` Gabriel Ravier
@ 2022-09-19 20:46             ` Nat!
  2022-09-20  8:51               ` Szabolcs Nagy
  2 siblings, 1 reply; 68+ messages in thread
From: Nat! @ 2022-09-19 20:46 UTC (permalink / raw)
  To: musl



On 19.09.22 20:14, Szabolcs Nagy wrote:
> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
>>
>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>>
>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
> obviously internally the implementation can use the internal chunk size...
>
> GetSize(p) is not the exact size (that the user allocated) but an internal
> size (which may be larger) and that must not be exposed *outside* of the
> malloc implementation (other than for diagnostic purposes).
>
> you can have 2 views:
>
> (1) tcmalloc and jemalloc are buggy because they expose an internal
>      that must not be exposed (becaues it can break user code).
>
> (2) user code is buggy if it uses malloc_usable_size for any purpose
>      other than diagnostic/statistics (because other uses are broken
>      on many implementations).
>
> either way the brokenness you want to support is a security hazard
> and you are lucky that musl saves the day: it works hard not to
> expose internal sizes so the code you seem to care about can operate
> safely (which is not true on tcmalloc and jemalloc: the compiler
> may break that code).
>
You can also have the third view, that malloc is allocating "at least" 
the amount of size requested (as it technically it is likely to do). 
That you can use "malloc_usable_size" to get the actually available 
size. That the code that is enforcing the semantics, that only the "at 
least" bytes should be accessed is in error, unless the error checking 
code modifies "malloc_usable_size" to only return the size as requested 
by the user.

Surely not a popular opinion :D :D

Ciao
    Nat!




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 19:21               ` Rich Felker
@ 2022-09-19 21:02                 ` Gabriel Ravier
  2022-09-19 21:47                   ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: Gabriel Ravier @ 2022-09-19 21:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: baiyang, James Y Knight, musl, Florian Weimer


On 9/19/22 21:21, Rich Felker wrote:
> On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote:
>> On 9/19/22 20:14, Szabolcs Nagy wrote:
>>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
>>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
>>>>
>>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>>>>
>>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
>>> obviously internally the implementation can use the internal chunk size...
>>>
>>> GetSize(p) is not the exact size (that the user allocated) but an internal
>>> size (which may be larger) and that must not be exposed *outside* of the
>>> malloc implementation (other than for diagnostic purposes).
>>>
>>> you can have 2 views:
>>>
>>> (1) tcmalloc and jemalloc are buggy because they expose an internal
>>>      that must not be exposed (becaues it can break user code).
>>>
>>> (2) user code is buggy if it uses malloc_usable_size for any purpose
>>>      other than diagnostic/statistics (because other uses are broken
>>>      on many implementations).
>>>
>>> either way the brokenness you want to support is a security hazard
>>> and you are lucky that musl saves the day: it works hard not to
>>> expose internal sizes so the code you seem to care about can operate
>>> safely (which is not true on tcmalloc and jemalloc: the compiler
>>> may break that code).
>> While I would agree that using malloc_usable_size is generally not a
>> great idea (it's at most acceptable as a small micro-optimization,
>> but I would only ever expect it to be seen in very well-tested code
>> in very hot loops, as it is indeed quite easily misused), it seems
>> like a bit of a stretch to say that all of:
>>
>> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)
>>
>> - systemd
>> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h
>> , along with all files using MALLOC_SIZEOF_SAFE, i.e.
>> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c,
>> src/basic/memory-util.h, src/basic/recurse-dir.c,
>> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c,
>> src/shared/journal-importer.c, src/shared/varlink.c,
>> src/test/test-alloc-util.c and src/test/test-compress.c)
>>
>> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc
>> , along with at least 20 other uses)
>>
>> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)
>>
>> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)
>>
>> - quickjs
>> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)
>>
>> - redis
>> (https://github.com/redis/redis/blob/unstable/src/networking.c,
>> along with a few other uses elsewhere)
>>
>> along with so many more well-known projects that I've given up on
>> listing them, are all buggy because of their usage of
>> malloc_usable_size...
> Depending on how you interpret the contract of malloc_usable_size
> (which was historically ambigious), either (1) or (2) above is
> *necessarily* true. It's not a matter of opinion just logical
> consequences of the choice you make.

Hmmmm... to me none of the two options you have given are true. Instead, 
I'd argue that, as seen by the C abstract machine, invocation of 
malloc_usable_size actually increases the size of a memory block, in 
place, as if by magic. This may seem stupid to some degree, but to me it 
is the only way one can make sense of the documentation for 
malloc_usable_size in glibc, as it has been present in glibc since 
malloc_usable_size was first added in 1996:

     This routine tells you how many bytes you can actually use in an
     allocated chunk, which may be more than you requested (although
     often not). You can use this many bytes without worrying about
     overwriting other allocated objects. Not a particularly great
     programming practice, but still sometimes useful.

To me, this note directly indicates that the function is intended to 
work, within the context of the abstract machine, as if it somehow 
increased the size of the block at the given address. It could perhaps 
have been better named as something like malloc_try_extend_in_place 
w.r.t. interpretation of the function's behavior as relating to the 
abstract machine, but the name it was given probably made more sense at 
the time, as one may have assumed that the other name would have meant 
that it might do some more work that just returning the internal 
allocated size.

This interpretation is also the one followed by pretty much every single 
program out there that uses malloc_usable_size, and the opposite 
interpretation (that malloc_usable_size is allowed to return a 
completely useless number that can't be used for anything useful) seems 
rather asinine to me


Such effects on the C abstract machine may seem ludicrous, as though 
nothing in C should be able to do things like this, but when you search 
for other examples of extensions to C that do weird things with the 
abstract machine, you can find some quite quickly:

- POSIX makes it so that storing into a function pointer by aliasing 
into it with a data pointer shall result in defined behavior

- GNU attributes like __attribute__((may_alias)) make it so that access 
through pointers with types with that attribute are defined even if they 
try to alias normally incompatible object types

- x86 intrinsics that access pointers are defined as having defined 
behavior even if they try to alias normally incompatible object types 
(the may_alias attribute above is used in GCC to implement them)

- POSIX makes it so that calling signal and then making the program 
multi-threaded shall result in defined behavior

- POSIX makes it so that calling fprintf with numbered argument 
conversion specifications (%n$), the apostrophe flag character (along 
with a few other things I'm too lazy to list) shall result in defined 
behavior

- Many more...

and yet all of these are quite often used and it is generally accepted 
that an implementation implementing such features should also make it so 
that the rules of the abstract machine, as implemented, are correctly 
affected and that the implementation does not suddenly make it so that 
using these result in UB where they claim that they implement the 
corresponding features.


You might say that you accept the changes POSIX makes to the abstract 
machine, but not those made by malloc_usable_size, but I would say that 
to define that function is to implicitly accept the semantics of it as 
defined by the implementation you're trying to follow: otherwise, it 
seems to me as though you're implementing the function wrongly, 
intentionally violating the specification of the function as originally 
written.


And yes, I would also argue that this does indeed make it so that 
_FORTIFY_SOURCE=3 as currently implemented is wrong (as shown in 
https://www.openwall.com/lists/musl/2022/09/19/23), instead of 
malloc_usable_size being a useless function

Also, the previously cited e-mail ended up leading me to a glibc 
bugzilla issue, in which it looks as though glibc developers have 
explicitly stated they consider the value returned by malloc_usable_size 
to be validly usable as the object size corresponding to the pointer: 
https://sourceware.org/bugzilla/show_bug.cgi?id=24775#c6


>
> Moreover, it's not at all a stretch to say 7+ popular projects have
> gigantic UB they don't care to fix. The whole story of musl has been
> finding *hundreds* of such projects, and eventually getting lots of
> them fixed.
>
> Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 21:02                 ` Gabriel Ravier
@ 2022-09-19 21:47                   ` Rich Felker
  2022-09-19 22:31                     ` Gabriel Ravier
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-19 21:47 UTC (permalink / raw)
  To: Gabriel Ravier; +Cc: baiyang, James Y Knight, musl, Florian Weimer

On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote:
> 
> On 9/19/22 21:21, Rich Felker wrote:
> >On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote:
> >>On 9/19/22 20:14, Szabolcs Nagy wrote:
> >>>* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
> >>>>I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
> >>>>
> >>>>On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
> >>>>
> >>>>I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
> >>>obviously internally the implementation can use the internal chunk size...
> >>>
> >>>GetSize(p) is not the exact size (that the user allocated) but an internal
> >>>size (which may be larger) and that must not be exposed *outside* of the
> >>>malloc implementation (other than for diagnostic purposes).
> >>>
> >>>you can have 2 views:
> >>>
> >>>(1) tcmalloc and jemalloc are buggy because they expose an internal
> >>>     that must not be exposed (becaues it can break user code).
> >>>
> >>>(2) user code is buggy if it uses malloc_usable_size for any purpose
> >>>     other than diagnostic/statistics (because other uses are broken
> >>>     on many implementations).
> >>>
> >>>either way the brokenness you want to support is a security hazard
> >>>and you are lucky that musl saves the day: it works hard not to
> >>>expose internal sizes so the code you seem to care about can operate
> >>>safely (which is not true on tcmalloc and jemalloc: the compiler
> >>>may break that code).
> >>While I would agree that using malloc_usable_size is generally not a
> >>great idea (it's at most acceptable as a small micro-optimization,
> >>but I would only ever expect it to be seen in very well-tested code
> >>in very hot loops, as it is indeed quite easily misused), it seems
> >>like a bit of a stretch to say that all of:
> >>
> >>- sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)
> >>
> >>- systemd
> >>(https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h
> >>, along with all files using MALLOC_SIZEOF_SAFE, i.e.
> >>src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c,
> >>src/basic/memory-util.h, src/basic/recurse-dir.c,
> >>src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c,
> >>src/shared/journal-importer.c, src/shared/varlink.c,
> >>src/test/test-alloc-util.c and src/test/test-compress.c)
> >>
> >>- rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc
> >>, along with at least 20 other uses)
> >>
> >>- folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)
> >>
> >>- lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)
> >>
> >>- quickjs
> >>(https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)
> >>
> >>- redis
> >>(https://github.com/redis/redis/blob/unstable/src/networking.c,
> >>along with a few other uses elsewhere)
> >>
> >>along with so many more well-known projects that I've given up on
> >>listing them, are all buggy because of their usage of
> >>malloc_usable_size...
> >Depending on how you interpret the contract of malloc_usable_size
> >(which was historically ambigious), either (1) or (2) above is
> >*necessarily* true. It's not a matter of opinion just logical
> >consequences of the choice you make.
> 
> Hmmmm... to me none of the two options you have given are true.
> Instead, I'd argue that, as seen by the C abstract machine,
> invocation of malloc_usable_size actually increases the size of a
> memory block, in place, as if by magic. This may seem stupid to some
> degree, but to me it is the only way one can make sense of the
> documentation for malloc_usable_size in glibc, as it has been
> present in glibc since malloc_usable_size was first added in 1996:

I don't think that's stupid; it's a very reasonable model.
Unfortunately it's one that's very hard to make the compiler aware of,
and if the compiler can't be aware of the possibility of the object
changing size, then it loses the ability to protect you from memory
errors without having this kind of "false positive".

Note that you can't just reason that "malloc_usable_size could
internally have called realloc" because the compiler can see that the
caller is still using the "old pointer", which is invalid, *even if
realloc returned a bitwise-identical pointer* (even just comparing to
see if it's equal is UB).

I don't know how to make any of this non-awful. Basically,
malloc_usable_size ties your hands with lots of unexpected
constraints. This function alone is almost solely responsible for the
musl inclusion/exclusion criteria policy for nonstandard extensions
being extremely cautious about adding dubious extensions even if they
"look harmless" at first -- we made the mistake of including it long
ago, and don't want to repeat that.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:17               ` baiyang
  2022-09-19 20:28                 ` Rich Felker
@ 2022-09-19 22:02                 ` Quentin Rameau
  1 sibling, 0 replies; 68+ messages in thread
From: Quentin Rameau @ 2022-09-19 22:02 UTC (permalink / raw)
  To: musl

Hi BaiYang,

> And I mentioned it before: we did massively optimize performance in real-world applications. These are not the focus of our discussion.

And now you're hitting a problem because those micro-non-standard optimizations that worked in some setup
can't be expected to work in a standard way.

This is something not surprising and quite usual with that kind of or programming orientation.

You may know that musl paradigms are quite opposite to that,
one being focused on correctness rather than being bug-compatible with other implementations.

As Rich said, there always might be place for improvement in code,
but that should be proven to work in the majority of cases,
and not at the expanse of stability or correctness.

I assume that if you can come up with such an improvement complying with those requirements,
they would be honestly considered,
but this present problem report have been answered several times
and it seems that the solution to your application problems resides in your application optimizations optimization.

Cheers!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 21:47                   ` Rich Felker
@ 2022-09-19 22:31                     ` Gabriel Ravier
  2022-09-19 22:46                       ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Gabriel Ravier @ 2022-09-19 22:31 UTC (permalink / raw)
  To: Rich Felker; +Cc: baiyang, James Y Knight, musl, Florian Weimer

On 9/19/22 23:47, Rich Felker wrote:
> On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote:
>> On 9/19/22 21:21, Rich Felker wrote:
>>> On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote:
>>>> On 9/19/22 20:14, Szabolcs Nagy wrote:
>>>>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
>>>>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
>>>>>>
>>>>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>>>>>>
>>>>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
>>>>> obviously internally the implementation can use the internal chunk size...
>>>>>
>>>>> GetSize(p) is not the exact size (that the user allocated) but an internal
>>>>> size (which may be larger) and that must not be exposed *outside* of the
>>>>> malloc implementation (other than for diagnostic purposes).
>>>>>
>>>>> you can have 2 views:
>>>>>
>>>>> (1) tcmalloc and jemalloc are buggy because they expose an internal
>>>>>      that must not be exposed (becaues it can break user code).
>>>>>
>>>>> (2) user code is buggy if it uses malloc_usable_size for any purpose
>>>>>      other than diagnostic/statistics (because other uses are broken
>>>>>      on many implementations).
>>>>>
>>>>> either way the brokenness you want to support is a security hazard
>>>>> and you are lucky that musl saves the day: it works hard not to
>>>>> expose internal sizes so the code you seem to care about can operate
>>>>> safely (which is not true on tcmalloc and jemalloc: the compiler
>>>>> may break that code).
>>>> While I would agree that using malloc_usable_size is generally not a
>>>> great idea (it's at most acceptable as a small micro-optimization,
>>>> but I would only ever expect it to be seen in very well-tested code
>>>> in very hot loops, as it is indeed quite easily misused), it seems
>>>> like a bit of a stretch to say that all of:
>>>>
>>>> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)
>>>>
>>>> - systemd
>>>> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h
>>>> , along with all files using MALLOC_SIZEOF_SAFE, i.e.
>>>> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c,
>>>> src/basic/memory-util.h, src/basic/recurse-dir.c,
>>>> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c,
>>>> src/shared/journal-importer.c, src/shared/varlink.c,
>>>> src/test/test-alloc-util.c and src/test/test-compress.c)
>>>>
>>>> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc
>>>> , along with at least 20 other uses)
>>>>
>>>> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)
>>>>
>>>> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)
>>>>
>>>> - quickjs
>>>> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)
>>>>
>>>> - redis
>>>> (https://github.com/redis/redis/blob/unstable/src/networking.c,
>>>> along with a few other uses elsewhere)
>>>>
>>>> along with so many more well-known projects that I've given up on
>>>> listing them, are all buggy because of their usage of
>>>> malloc_usable_size...
>>> Depending on how you interpret the contract of malloc_usable_size
>>> (which was historically ambigious), either (1) or (2) above is
>>> *necessarily* true. It's not a matter of opinion just logical
>>> consequences of the choice you make.
>> Hmmmm... to me none of the two options you have given are true.
>> Instead, I'd argue that, as seen by the C abstract machine,
>> invocation of malloc_usable_size actually increases the size of a
>> memory block, in place, as if by magic. This may seem stupid to some
>> degree, but to me it is the only way one can make sense of the
>> documentation for malloc_usable_size in glibc, as it has been
>> present in glibc since malloc_usable_size was first added in 1996:
> I don't think that's stupid; it's a very reasonable model.
> Unfortunately it's one that's very hard to make the compiler aware of,
> and if the compiler can't be aware of the possibility of the object
> changing size, then it loses the ability to protect you from memory
> errors without having this kind of "false positive".
>
> Note that you can't just reason that "malloc_usable_size could
> internally have called realloc" because the compiler can see that the
> caller is still using the "old pointer", which is invalid, *even if
> realloc returned a bitwise-identical pointer* (even just comparing to
> see if it's equal is UB).
>
> I don't know how to make any of this non-awful. Basically,
> malloc_usable_size ties your hands with lots of unexpected
> constraints. This function alone is almost solely responsible for the
> musl inclusion/exclusion criteria policy for nonstandard extensions
> being extremely cautious about adding dubious extensions even if they
> "look harmless" at first -- we made the mistake of including it long
> ago, and don't want to repeat that.
>
> Rich

Well, you've said earlier that your implementation keeps in memory the 
original requested amount of memory and just returns that, and that 
implementation seems perfectly fine: it might be going slightly against 
the spirit of the original function, but as you've pointed out, it's a 
nice way of avoiding the awfulness that malloc_usable_size results in 
w.r.t. the work the compiler has to do to have it work properly while 
not ruining object size instrumentation/optimizations (although I 
suppose it might lead to criticisms of it taking up valuable heap 
size/making programs slow by neutralising APIs used to optimize them, as 
seen in this thread...)


My argumentation was entirely against stating that relying on 
malloc_usable_size to indicate the size of an object is UB: to me, libcs 
that can't return a valid result for it should not pretend they support 
it properly, and compilers that make optimizations that make it not work 
when it returns a higher size than the originally asked-for size should 
not pretend they support it properly either. GCC with _FORTIFY_SOURCE 
coupled with glibc seems to expose this issue, and they basically need 
to either:

- have GCC properly take malloc_usable_size into account w.r.t. their 
__builtin_dynamic_object_size, which might just be a complete nightmare 
to implement...

- have glibc warn upon its usage with _FORTIFY_SOURCE, which might end 
up with people complaining about _FORTIFY_SOURCE breaking their programs 
and disabling useful APIs...

- start the long process of deprecating the function out of glibc 
entirely, which might end up with people complaining about glibc making 
their programs less fast by disabling APIs they use to do so...

- or change their heap implementation so they can return the originally 
requested size, which might end up with people complaining about glibc 
making their programs take more heap space and slower by returning a 
"useless" value out of malloc_usable_size (w.r.t. speed optimisations)...

Quite the potentially awful choice to make indeed...



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 22:31                     ` Gabriel Ravier
@ 2022-09-19 22:46                       ` baiyang
  0 siblings, 0 replies; 68+ messages in thread
From: baiyang @ 2022-09-19 22:46 UTC (permalink / raw)
  To: Gabriel Ravier, Rich Felker; +Cc: James Y Knight, musl, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 8664 bytes --]

> My argumentation was entirely against stating that relying on malloc_usable_size to indicate the size of an object is UB 

I totally agree with this point, in all known cases the function where the error occurs is __builtin_dynamic_object_size, not malloc_usable_size.

As you referenced earlier in the bugzilla discussion (https://sourceware.org/bugzilla/show_bug.cgi?id=24775#c6): This is a bug in Msan. NOT in malloc_usable_size.

So it seems to me an odd choice to add significant overhead to calls like `malloc_usable_size`, `realloc`, and `free` in order to "fix" this non-existent bug.
 
--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Gabriel Ravier
Date: 2022-09-20 06:31
To: Rich Felker
CC: baiyang; James Y Knight; musl; Florian Weimer
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On 9/19/22 23:47, Rich Felker wrote:
> On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote:
>> On 9/19/22 21:21, Rich Felker wrote:
>>> On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote:
>>>> On 9/19/22 20:14, Szabolcs Nagy wrote:
>>>>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
>>>>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
>>>>>>
>>>>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>>>>>>
>>>>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
>>>>> obviously internally the implementation can use the internal chunk size...
>>>>>
>>>>> GetSize(p) is not the exact size (that the user allocated) but an internal
>>>>> size (which may be larger) and that must not be exposed *outside* of the
>>>>> malloc implementation (other than for diagnostic purposes).
>>>>>
>>>>> you can have 2 views:
>>>>>
>>>>> (1) tcmalloc and jemalloc are buggy because they expose an internal
>>>>>      that must not be exposed (becaues it can break user code).
>>>>>
>>>>> (2) user code is buggy if it uses malloc_usable_size for any purpose
>>>>>      other than diagnostic/statistics (because other uses are broken
>>>>>      on many implementations).
>>>>>
>>>>> either way the brokenness you want to support is a security hazard
>>>>> and you are lucky that musl saves the day: it works hard not to
>>>>> expose internal sizes so the code you seem to care about can operate
>>>>> safely (which is not true on tcmalloc and jemalloc: the compiler
>>>>> may break that code).
>>>> While I would agree that using malloc_usable_size is generally not a
>>>> great idea (it's at most acceptable as a small micro-optimization,
>>>> but I would only ever expect it to be seen in very well-tested code
>>>> in very hot loops, as it is indeed quite easily misused), it seems
>>>> like a bit of a stretch to say that all of:
>>>>
>>>> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c)
>>>>
>>>> - systemd
>>>> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h
>>>> , along with all files using MALLOC_SIZEOF_SAFE, i.e.
>>>> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c,
>>>> src/basic/memory-util.h, src/basic/recurse-dir.c,
>>>> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c,
>>>> src/shared/journal-importer.c, src/shared/varlink.c,
>>>> src/test/test-alloc-util.c and src/test/test-compress.c)
>>>>
>>>> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc
>>>> , along with at least 20 other uses)
>>>>
>>>> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h)
>>>>
>>>> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp)
>>>>
>>>> - quickjs
>>>> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c)
>>>>
>>>> - redis
>>>> (https://github.com/redis/redis/blob/unstable/src/networking.c,
>>>> along with a few other uses elsewhere)
>>>>
>>>> along with so many more well-known projects that I've given up on
>>>> listing them, are all buggy because of their usage of
>>>> malloc_usable_size...
>>> Depending on how you interpret the contract of malloc_usable_size
>>> (which was historically ambigious), either (1) or (2) above is
>>> *necessarily* true. It's not a matter of opinion just logical
>>> consequences of the choice you make.
>> Hmmmm... to me none of the two options you have given are true.
>> Instead, I'd argue that, as seen by the C abstract machine,
>> invocation of malloc_usable_size actually increases the size of a
>> memory block, in place, as if by magic. This may seem stupid to some
>> degree, but to me it is the only way one can make sense of the
>> documentation for malloc_usable_size in glibc, as it has been
>> present in glibc since malloc_usable_size was first added in 1996:
> I don't think that's stupid; it's a very reasonable model.
> Unfortunately it's one that's very hard to make the compiler aware of,
> and if the compiler can't be aware of the possibility of the object
> changing size, then it loses the ability to protect you from memory
> errors without having this kind of "false positive".
>
> Note that you can't just reason that "malloc_usable_size could
> internally have called realloc" because the compiler can see that the
> caller is still using the "old pointer", which is invalid, *even if
> realloc returned a bitwise-identical pointer* (even just comparing to
> see if it's equal is UB).
>
> I don't know how to make any of this non-awful. Basically,
> malloc_usable_size ties your hands with lots of unexpected
> constraints. This function alone is almost solely responsible for the
> musl inclusion/exclusion criteria policy for nonstandard extensions
> being extremely cautious about adding dubious extensions even if they
> "look harmless" at first -- we made the mistake of including it long
> ago, and don't want to repeat that.
>
> Rich
 
Well, you've said earlier that your implementation keeps in memory the 
original requested amount of memory and just returns that, and that 
implementation seems perfectly fine: it might be going slightly against 
the spirit of the original function, but as you've pointed out, it's a 
nice way of avoiding the awfulness that malloc_usable_size results in 
w.r.t. the work the compiler has to do to have it work properly while 
not ruining object size instrumentation/optimizations (although I 
suppose it might lead to criticisms of it taking up valuable heap 
size/making programs slow by neutralising APIs used to optimize them, as 
seen in this thread...)
 
 
My argumentation was entirely against stating that relying on 
malloc_usable_size to indicate the size of an object is UB: to me, libcs 
that can't return a valid result for it should not pretend they support 
it properly, and compilers that make optimizations that make it not work 
when it returns a higher size than the originally asked-for size should 
not pretend they support it properly either. GCC with _FORTIFY_SOURCE 
coupled with glibc seems to expose this issue, and they basically need 
to either:
 
- have GCC properly take malloc_usable_size into account w.r.t. their 
__builtin_dynamic_object_size, which might just be a complete nightmare 
to implement...
 
- have glibc warn upon its usage with _FORTIFY_SOURCE, which might end 
up with people complaining about _FORTIFY_SOURCE breaking their programs 
and disabling useful APIs...
 
- start the long process of deprecating the function out of glibc 
entirely, which might end up with people complaining about glibc making 
their programs less fast by disabling APIs they use to do so...
 
- or change their heap implementation so they can return the originally 
requested size, which might end up with people complaining about glibc 
making their programs take more heap space and slower by returning a 
"useless" value out of malloc_usable_size (w.r.t. speed optimisations)...
 
Quite the potentially awful choice to make indeed...
 
 

[-- Attachment #2: Type: text/html, Size: 13839 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 17:40         ` baiyang
  2022-09-19 18:14           ` Szabolcs Nagy
@ 2022-09-20  0:13           ` James Y Knight
  2022-09-20  0:25             ` baiyang
  1 sibling, 1 reply; 68+ messages in thread
From: James Y Knight @ 2022-09-20  0:13 UTC (permalink / raw)
  To: baiyang; +Cc: musl, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

On Mon, Sep 19, 2022 at 1:40 PM baiyang <baiyang@gmail.com> wrote:

> Hi James,
>
> I looked at the code of tcmalloc, but I didn't find any of the problems
> you mentioned in the implementation of malloc_usable_size (see:
> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099
>  ).
>
> On the contrary, similar to musl, tcmalloc also directly uses the return
> value of malloc_usable_size in its realloc implementation to determine
> whether memory needs to be reallocated:
> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
>
> I think this is enough to show that the return value of malloc_usable_size
> in tcmalloc is accurate and reliable, otherwise its own realloc will
> cause a segment fault.
>

It returns an accurate value, but it is undefined behavior to actually
_use_ that memory. You need to first call realloc -- and when calling
realloc, you (probably...) need to access the new memory via the return
value of the realloc call, rather than the pointer you passed as input --
even if they happen to have the same integer representation. (Though
there's certainly plenty of debate/discussion regarding the exact workings
of, and advisability of, such 'pointer provenance" rules...). In any case,
calling the function inside the malloc implementation to implement realloc
is a different matter entirely from user code calling it.

And, note that just because something works for you today doesn't mean it's
correct or allowed. It just means that you've been able to get away with
it. So far. Until one day it stops working. That's the really fun part of
Undefined Behavior...

[-- Attachment #2: Type: text/html, Size: 3338 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  0:13           ` James Y Knight
@ 2022-09-20  0:25             ` baiyang
  2022-09-20  0:38               ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  0:25 UTC (permalink / raw)
  To: James Y Knight; +Cc: musl, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan.
Just read more on this thread :-) 

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: James Y Knight
Date: 2022-09-20 08:13
To: baiyang
CC: musl; Florian Weimer
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Mon, Sep 19, 2022 at 1:40 PM baiyang <baiyang@gmail.com> wrote:
Hi James,

I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).

On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499

I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.

It returns an accurate value, but it is undefined behavior to actually _use_ that memory. You need to first call realloc -- and when calling realloc, you (probably...) need to access the new memory via the return value of the realloc call, rather than the pointer you passed as input -- even if they happen to have the same integer representation. (Though there's certainly plenty of debate/discussion regarding the exact workings of, and advisability of, such 'pointer provenance" rules...). In any case, calling the function inside the malloc implementation to implement realloc is a different matter entirely from user code calling it.

And, note that just because something works for you today doesn't mean it's correct or allowed. It just means that you've been able to get away with it. So far. Until one day it stops working. That's the really fun part of Undefined Behavior...

[-- Attachment #2: Type: text/html, Size: 5449 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  0:25             ` baiyang
@ 2022-09-20  0:38               ` Rich Felker
  2022-09-20  0:47                 ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20  0:38 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 08:25:52AM +0800, baiyang wrote:
> As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan.
> Just read more on this thread :-) 

Would it be possible to limit use of the list to actually requesting
help or making reports, rather than inciting debates about what is UB
or what the consequences of UB might be?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  0:38               ` Rich Felker
@ 2022-09-20  0:47                 ` baiyang
  2022-09-20  1:00                   ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  0:47 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

> Would it be possible to limit use of the list to actually requesting
> help or making reports, rather than inciting debates about what is UB
> or what the consequences of UB might be?

You are right. 

The real question is: if we only need malloc_usable_size to return the size actually allocated internally (not the size requested by the user, **just as musl version 1.1 and all other libc implementations do**), is it possible to improve its time and space efficiency?

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Rich Felker
Date: 2022-09-20 08:38
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 08:25:52AM +0800, baiyang wrote:
> As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan.
> Just read more on this thread :-) 
 
Would it be possible to limit use of the list to actually requesting
help or making reports, rather than inciting debates about what is UB
or what the consequences of UB might be?

[-- Attachment #2: Type: text/html, Size: 3184 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  0:47                 ` baiyang
@ 2022-09-20  1:00                   ` Rich Felker
  2022-09-20  1:18                     ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20  1:00 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 08:47:07AM +0800, baiyang wrote:
> > Would it be possible to limit use of the list to actually requesting
> > help or making reports, rather than inciting debates about what is UB
> > or what the consequences of UB might be?
> 
> You are right. 
> 
> The real question is: if we only need malloc_usable_size to return
> the size actually allocated internally (not the size requested by
> the user, **just as musl version 1.1 and all other libc
> implementations do**), is it possible to improve its time and space
> efficiency?

There is no hidden "size actually allocated internally". The size you
get is the size you requested. Everything else is allocator data
structures *outside of the object* that the caller has no entitlement
to peek or poke at, and malloc_usable_size's return value reflects
that.

If you want to see what portion of the time is being spent on
different parts of processing the metadata, you could sit down and
actually run it under perf to get a profiling report/flame graph. I'm
pretty sure you'll find that the final get_nominal_size step is a
small portion of the time spent. get_meta is probably the majority of
the time, some of it fundamental, and some of it hardening. But don't
take my word for it. Measure.

One thing I can tell you definitively though: if you did what the C
language (which lacks malloc_usable_size) intended you to do, and kept
track of the size of your own buffer, and just used that, you would
spend 0% of the time you're spending on this. You would also save the
entire "several hundred ms per 10 million calls" it's costing on other
malloc implementations, by just *not doing something you don't need to
do*.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  1:00                   ` Rich Felker
@ 2022-09-20  1:18                     ` baiyang
  2022-09-20  2:15                       ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  1:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 2883 bytes --]

> There is no hidden "size actually allocated internally". The size you
> get is the size you requested. Everything else is allocator data
> structures *outside of the object* that the caller has no entitlement
> to peek or poke at, and malloc_usable_size's return value reflects
> that.

If I understand correctly, according to the definition of size_classes in the mallocng code: 
1. When I call `void* p = malloc(6600)`, mallocng actually allocates more than 8100 bytes of usable space, right?
2. According to your previous explanation, calling malloc_usable_size(p) at this time returns 6600, right?

My question is, if malloc_usable_size(p) can directly return 8191 (or similar actual allocated size, as other libc do) instead of 6600, is it possible to make mallocng achieve higher performance both in time and space?

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 09:00
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 08:47:07AM +0800, baiyang wrote:
> > Would it be possible to limit use of the list to actually requesting
> > help or making reports, rather than inciting debates about what is UB
> > or what the consequences of UB might be?
> 
> You are right. 
> 
> The real question is: if we only need malloc_usable_size to return
> the size actually allocated internally (not the size requested by
> the user, **just as musl version 1.1 and all other libc
> implementations do**), is it possible to improve its time and space
> efficiency?

There is no hidden "size actually allocated internally". The size you
get is the size you requested. Everything else is allocator data
structures *outside of the object* that the caller has no entitlement
to peek or poke at, and malloc_usable_size's return value reflects
that.

If you want to see what portion of the time is being spent on
different parts of processing the metadata, you could sit down and
actually run it under perf to get a profiling report/flame graph. I'm
pretty sure you'll find that the final get_nominal_size step is a
small portion of the time spent. get_meta is probably the majority of
the time, some of it fundamental, and some of it hardening. But don't
take my word for it. Measure.

One thing I can tell you definitively though: if you did what the C
language (which lacks malloc_usable_size) intended you to do, and kept
track of the size of your own buffer, and just used that, you would
spend 0% of the time you're spending on this. You would also save the
entire "several hundred ms per 10 million calls" it's costing on other
malloc implementations, by just *not doing something you don't need to
do*.

Rich

[-- Attachment #2: Type: text/html, Size: 6578 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  1:18                     ` baiyang
@ 2022-09-20  2:15                       ` Rich Felker
  2022-09-20  2:35                         ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20  2:15 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 09:18:04AM +0800, baiyang wrote:
> > There is no hidden "size actually allocated internally". The size you
> > get is the size you requested. Everything else is allocator data
> > structures *outside of the object* that the caller has no entitlement
> > to peek or poke at, and malloc_usable_size's return value reflects
> > that.
> 
> If I understand correctly, according to the definition of size_classes in the mallocng code: 
> 1. When I call `void* p = malloc(6600)`, mallocng actually allocates
> more than 8100 bytes of usable space, right?

No, it uses space from a size-class-8176 group (~=slab) to produce an
allocation of size 6600. The *allocation* is the part that belongs to
the caller. Everything else is part of the allocator data structures.

> 2. According to your previous explanation, calling
> malloc_usable_size(p) at this time returns 6600, right?

Yes.

> My question is, if malloc_usable_size(p) can directly return 8191
> (or similar actual allocated size, as other libc do) instead of
> 6600, is it possible to make mallocng achieve higher performance
> both in time and space?

No, and the reason you said you want it to does not make sense. You
seem to think that if the group stride was 8100, calling realloc might
memcpy up to 8100 bytes. This is not the case. If realloc has to
allocate a new object, the amount copied will be 6600 or exactly
whatever the allocated object size was (or the new size, if smaller).
This is the only meaningful number.

You also seem to be under the impression that the work to determine
that the size was 6600 and not 8100 is where most (or at least a
significant portion of) the time is spent. This is also not the case.
The majority of the metadata processing time is chasing pointers back
to the out-of-band metadata, validating it, validating that it
round-trips back, and validating various other things. Some of these
could in principle be omitted at the cost of loss-of-hardening.

Figuring out that the allocation is 6600 bytes, once you already know
the size class and out-of-band metadata, is quite trivial and hardly
takes any of the time. (It also has a few validation checks that could
be omitted at the cost of loss of hardening, but these are
proportionally much smaller.)

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  2:15                       ` Rich Felker
@ 2022-09-20  2:35                         ` baiyang
  2022-09-20  3:28                           ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  2:35 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4299 bytes --]

> You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case.

Yes, I already understood that mallocng would only memcpy 6600 bytes when I was told that malloc_usable_size will return the size requested by the user.

But AFAIK, many other malloc implementations basically don't keep 6600 bytes of data. So they're actually going to memcpy the 8100 bytes.

> You also seem to be under the impression that the work to determine
> that the size was 6600 and not 8100 is where most (or at least a
> significant portion of) the time is spent.  This is also not the case.
> The majority of the metadata processing time is chasing pointers back
> to the out-of-band metadata, validating it, validating that it
> round-trips back, and validating various other things. Some of these
> could in principle be omitted at the cost of loss-of-hardening.

Yes, according to my previous understanding (which seems wrong now), since other malloc_usable_size implementations that directly return 8100 (the actual allocated size class length) such as tcmalloc are all very fast, so I can only understand that mallocng is so much slower than them because it has to return 6600, not 8100. Apart from this difference, there is no reason it is slower than other implementations of malloc_usable_size as I understand it.

If this is not the main reason, can we speed up this algorithm with the help of a fast lookup table mechanism like tcmalloc? As I said before, this not only greatly increases the performance of malloc_usable_size , but also the performance of realloc and free .

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 10:15
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 09:18:04AM +0800, baiyang wrote:
> > There is no hidden "size actually allocated internally". The size you
> > get is the size you requested. Everything else is allocator data
> > structures *outside of the object* that the caller has no entitlement
> > to peek or poke at, and malloc_usable_size's return value reflects
> > that.
> 
> If I understand correctly, according to the definition of size_classes in the mallocng code: 
> 1. When I call `void* p = malloc(6600)`, mallocng actually allocates
> more than 8100 bytes of usable space, right?

No, it uses space from a size-class-8176 group (~=slab) to produce an
allocation of size 6600. The *allocation* is the part that belongs to
the caller. Everything else is part of the allocator data structures.

> 2. According to your previous explanation, calling
> malloc_usable_size(p) at this time returns 6600, right?

Yes.

> My question is, if malloc_usable_size(p) can directly return 8191
> (or similar actual allocated size, as other libc do) instead of
> 6600, is it possible to make mallocng achieve higher performance
> both in time and space?

No, and the reason you said you want it to does not make sense. You
seem to think that if the group stride was 8100, calling realloc might
memcpy up to 8100 bytes. This is not the case. If realloc has to
allocate a new object, the amount copied will be 6600 or exactly
whatever the allocated object size was (or the new size, if smaller).
This is the only meaningful number.

You also seem to be under the impression that the work to determine
that the size was 6600 and not 8100 is where most (or at least a
significant portion of) the time is spent. This is also not the case.
The majority of the metadata processing time is chasing pointers back
to the out-of-band metadata, validating it, validating that it
round-trips back, and validating various other things. Some of these
could in principle be omitted at the cost of loss-of-hardening.

Figuring out that the allocation is 6600 bytes, once you already know
the size class and out-of-band metadata, is quite trivial and hardly
takes any of the time. (It also has a few validation checks that could
be omitted at the cost of loss of hardening, but these are
proportionally much smaller.)

Rich

[-- Attachment #2: Type: text/html, Size: 7554 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  2:35                         ` baiyang
@ 2022-09-20  3:28                           ` Rich Felker
  2022-09-20  3:53                             ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20  3:28 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 10:35:02AM +0800, baiyang wrote:
> > You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case.
> 
> Yes, I already understood that mallocng would only memcpy 6600 bytes
> when I was told that malloc_usable_size will return the size
> requested by the user.
> 
> But AFAIK, many other malloc implementations basically don't keep
> 6600 bytes of data. So they're actually going to memcpy the 8100
> bytes.

You have made up a problem that does not exist. Specifically, at least
as far as I can tell, no such implementation exists.

The ones that return some value larger than the requested size are
returning "the requested size, rounded up to a multiple of 16" or
similar. Not "the requested size plus 1500 bytes". For the
dlmalloc-style allocators, this is because they do not have uniform
slots for size classes; they have arbitrary splits, and only care
about aligning start/end on a properly aligned boundary. And for more
slab-like allocators, they all keep track of at least a coarse "actual
size" specifically so that realloc does not gratuitously copy large
amounts of unnecessary data (among other reasons).

> > You also seem to be under the impression that the work to determine
> > that the size was 6600 and not 8100 is where most (or at least a
> > significant portion of) the time is spent.  This is also not the case.
> > The majority of the metadata processing time is chasing pointers back
> > to the out-of-band metadata, validating it, validating that it
> > round-trips back, and validating various other things. Some of these
> > could in principle be omitted at the cost of loss-of-hardening.
> 
> Yes, according to my previous understanding (which seems wrong now),
> since other malloc_usable_size implementations that directly return
> 8100 (the actual allocated size class length)

They don't return 8100. They return something like 6608 or 6624.

> such as tcmalloc are
> all very fast, so I can only understand that mallocng is so much
> slower than them because it has to return 6600, not 8100.

This does not follow at all. tcmalloc is fast because it does not have
global consistency, does not have any notable hardening, and (see the
name) keeps large numbers of freed slots *cached* to reuse, thereby
using lots of extra memory. Its malloc_usable_size is not fast because
of returning the wrong value, if it even does return the wrong value
(I have no idea). It's fast because they store the size in-band right
next to the allocated memory and trust that it's valid, rather than
computing it from out-of-band metadata that is not subject to
falsification unless the attacker already has nearly full control of
execution.

> Apart from
> this difference, there is no reason it is slower than other
> implementations of malloc_usable_size as I understand it.

Comparing the speed of malloc_usable_size is utterly pointless. That
is not where the time is spent in any real-world load that's not
gratuitously doing stupid things. I imagine tcmalloc's is faster, for
the reasons explained above (which have nothing to do with whether the
value returned is the exact size you requested). But this has nothing
to do with why the overall performance is higher.

> If this is not the main reason, can we speed up this algorithm with
> the help of a fast lookup table mechanism like tcmalloc? As I said
> before, this not only greatly increases the performance of
> malloc_usable_size , but also the performance of realloc and free .

No, because the fundamental difference is where you're storing the
data, and the *whole point* of mallocng is that we're storing it in a
place where it's not subject to attack via overflows from other
objects, UAF/DF, etc. This makes it somewhat costlier (but still very
reasonable) to obtain.

If you have a particular application where you actually want to trade
safety and low memory usage for performance, you're perfectly free to
link that application to whatever malloc implementation you like. musl
has supported doing that since 1.1.20. It makes no sense to do this
system-wide. You would just increase memory usage and attack surface
drastically with nothing to show for it, since the vast majority of
programs *don't care* how long malloc takes.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  3:28                           ` Rich Felker
@ 2022-09-20  3:53                             ` baiyang
  2022-09-20  5:41                               ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  3:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 6951 bytes --]

> The ones that return some value larger than the requested size are
> returning "the requested size, rounded up to a multiple of 16" or
> similar. Not "the requested size plus 1500 bytes". 
...
> They don't return 8100. They return something like 6608 or 6624.

No, AFAIK, There are many allocators whose return value of malloc_usable_size is 1KB (or more) larger than the requested value at malloc time.
For Example: if you do "void* p = malloc(6700)" on tcmalloc, then "malloc_usable_size(p)" will return **8192**. Far more than just "rounded up to a multiple of 16".

> This does not follow at all. tcmalloc is fast because it does not have
> global consistency, does not have any notable hardening, and (see the
> name) keeps large numbers of freed slots *cached* to reuse, thereby
> using lots of extra memory. Its malloc_usable_size is not fast because
> of returning the wrong value, if it even does return the wrong value
> (I have no idea). 

We don't need to refer to these features of tcmalloc, we only need to refer to its malloc_usable_size algorithm.

> It's fast because they store the size in-band right
> next to the allocated memory and trust that it's valid, rather than
> computing it from out-of-band metadata that is not subject to
> falsification unless the attacker already has nearly full control of
> execution.

No, if I understand correctly, tcmalloce doesn't store the size in-band right next to the allocated memory. On the contrary, when executing malloc_usable_size(p) (actually GetSize(p)), it will first find the size class corresponding to p through a quick lookup table, and then return the length of the size class. See: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 

My understanding: the biggest impediment to our inability to apply similar optimizations is that we have to return 6700, not 8192 (of course, you've denied this is the reason). 

On the other hand, if the low speed is not caused by having to return 6700, then we should be able to use a similar quick lookup table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve at least dozens of times performance improvement.

Thanks :-P

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 11:28
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 10:35:02AM +0800, baiyang wrote:
> > You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case.
> 
> Yes, I already understood that mallocng would only memcpy 6600 bytes
> when I was told that malloc_usable_size will return the size
> requested by the user.
> 
> But AFAIK, many other malloc implementations basically don't keep
> 6600 bytes of data. So they're actually going to memcpy the 8100
> bytes.

You have made up a problem that does not exist. Specifically, at least
as far as I can tell, no such implementation exists.

The ones that return some value larger than the requested size are
returning "the requested size, rounded up to a multiple of 16" or
similar. Not "the requested size plus 1500 bytes". For the
dlmalloc-style allocators, this is because they do not have uniform
slots for size classes; they have arbitrary splits, and only care
about aligning start/end on a properly aligned boundary. And for more
slab-like allocators, they all keep track of at least a coarse "actual
size" specifically so that realloc does not gratuitously copy large
amounts of unnecessary data (among other reasons).

> > You also seem to be under the impression that the work to determine
> > that the size was 6600 and not 8100 is where most (or at least a
> > significant portion of) the time is spent.  This is also not the case.
> > The majority of the metadata processing time is chasing pointers back
> > to the out-of-band metadata, validating it, validating that it
> > round-trips back, and validating various other things. Some of these
> > could in principle be omitted at the cost of loss-of-hardening.
> 
> Yes, according to my previous understanding (which seems wrong now),
> since other malloc_usable_size implementations that directly return
> 8100 (the actual allocated size class length)

They don't return 8100. They return something like 6608 or 6624.

> such as tcmalloc are
> all very fast, so I can only understand that mallocng is so much
> slower than them because it has to return 6600, not 8100.

This does not follow at all. tcmalloc is fast because it does not have
global consistency, does not have any notable hardening, and (see the
name) keeps large numbers of freed slots *cached* to reuse, thereby
using lots of extra memory. Its malloc_usable_size is not fast because
of returning the wrong value, if it even does return the wrong value
(I have no idea). It's fast because they store the size in-band right
next to the allocated memory and trust that it's valid, rather than
computing it from out-of-band metadata that is not subject to
falsification unless the attacker already has nearly full control of
execution.

> Apart from
> this difference, there is no reason it is slower than other
> implementations of malloc_usable_size as I understand it.

Comparing the speed of malloc_usable_size is utterly pointless. That
is not where the time is spent in any real-world load that's not
gratuitously doing stupid things. I imagine tcmalloc's is faster, for
the reasons explained above (which have nothing to do with whether the
value returned is the exact size you requested). But this has nothing
to do with why the overall performance is higher.

> If this is not the main reason, can we speed up this algorithm with
> the help of a fast lookup table mechanism like tcmalloc? As I said
> before, this not only greatly increases the performance of
> malloc_usable_size , but also the performance of realloc and free .

No, because the fundamental difference is where you're storing the
data, and the *whole point* of mallocng is that we're storing it in a
place where it's not subject to attack via overflows from other
objects, UAF/DF, etc. This makes it somewhat costlier (but still very
reasonable) to obtain.

If you have a particular application where you actually want to trade
safety and low memory usage for performance, you're perfectly free to
link that application to whatever malloc implementation you like. musl
has supported doing that since 1.1.20. It makes no sense to do this
system-wide. You would just increase memory usage and attack surface
drastically with nothing to show for it, since the vast majority of
programs *don't care* how long malloc takes.

Rich

[-- Attachment #2: Type: text/html, Size: 13675 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  3:53                             ` baiyang
@ 2022-09-20  5:41                               ` Rich Felker
  2022-09-20  5:56                                 ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20  5:41 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote:
> > The ones that return some value larger than the requested size are
> > returning "the requested size, rounded up to a multiple of 16" or
> > similar. Not "the requested size plus 1500 bytes". 
> ...
> > They don't return 8100. They return something like 6608 or 6624.
> 
> No, AFAIK, There are many allocators whose return value of
> malloc_usable_size is 1KB (or more) larger than the requested value
> at malloc time.
> For Example: if you do "void* p = malloc(6700)" on tcmalloc, then
> "malloc_usable_size(p)" will return **8192**. Far more than just
> "rounded up to a multiple of 16".

OK, thanks for checking and correcting.

> > This does not follow at all. tcmalloc is fast because it does not have
> > global consistency, does not have any notable hardening, and (see the
> > name) keeps large numbers of freed slots *cached* to reuse, thereby
> > using lots of extra memory. Its malloc_usable_size is not fast because
> > of returning the wrong value, if it even does return the wrong value
> > (I have no idea). 
> 
> We don't need to refer to these features of tcmalloc, we only need
> to refer to its malloc_usable_size algorithm.

Those (mis)features are what provide a fast path here, regardless of
whether you care about them.

> > It's fast because they store the size in-band right
> > next to the allocated memory and trust that it's valid, rather than
> > computing it from out-of-band metadata that is not subject to
> > falsification unless the attacker already has nearly full control of
> > execution.
> 
> No, if I understand correctly, tcmalloce doesn't store the size
> in-band right next to the allocated memory. On the contrary, when
> executing malloc_usable_size(p) (actually GetSize(p)), it will first
> find the size class corresponding to p through a quick lookup table,
> and then return the length of the size class. See:
> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099

OK, I was confusing tcmalloc with the more conventional "thread-local
freelist caching on top of dlmalloc type base" allocator strategy.
Indeed tcmalloc however is one of the gigantic ones.

> My understanding: the biggest impediment to our inability to apply
> similar optimizations is that we have to return 6700, not 8192 (of
> course, you've denied this is the reason).

Your understanding is wrong. I've told you how you can measure that
it's wrong. You insist on being stuck on it for no good reason.

If you want to understand *why* tcmalloc is different, start with the
comments at the top of the file you linked:

> //  4. The pagemap (which maps from page-number to descriptor),
> //     can be read without holding any locks, and written while holding
> //     the "pageheap_lock".
> //
> //     This multi-threaded access to the pagemap is safe for fairly
> //     subtle reasons.  We basically assume that when an object X is
> //     allocated by thread A and deallocated by thread B, there must
> //     have been appropriate synchronization in the handoff of object
> //     X from thread A to thread B.

This is the kind of thing I mean by lack of global consistency (no
synchronization around access to these data structures) and lack of
any meaningful hardening (*assuming* no memory lifetime usage errors
in the calling application).

The GetSize function you cited uses this global pagemap to go straight
from a page address to a sizeclass, via what amounts to a two-level or
three-level table indexed by upper bits of the address (comment says
3-level is only used in slower but lower-mem-use configuration). These
tables, at least in the 2-level form, are utterly *massive*. I'm not
sure if it creates them PROT_NONE and then only instantiates real
memory for the (initially fairly sparse) parts that get used, or if it
just allocates these giant things relying on overcommit, but either
way that's not compatible with small-memory-space systems or with
nommu.

On top of that, this approach relies on laying out whole pages (likely
large slabs of many pages at a time) of identical-sized objects so
that the size and other properties can be looked up by page number. I
have not looked into the details of "how bad" it gets, but it
completely precludes having any small processes, and precludes
promptly returning freed memory to the system, since *changing* the
pagemap is going to be costly and they're going to avoid doing it
(note the above comment on locking).

mallocng does not have any global mapping optimizing translation from
addresses to groups/metadata objects. Because we insist on global
consistency (a prerequisite for being able to define strong hardening
properties) and on being able to return freed memory promptly to the
system, maintaining such a data structure would cost a lot more time
(performance) than anything it could give, and it would make lock-free
operations (like your malloc_usable_size, or trivial realloc calls)
potentially require locking.

Instead of using the numeric value of the address to map to metadata,
we chase offsets from the object base address to the metadata, then
validate that it round-trips back to conclude that we didn't just
follow random junk from the caller passing an invalid/dangling pointer
in, or from this data being overwritten via heap-based buffer
overflows.

Fundamentally, this pointer chasing is going to be a little bit more
expensive than just using address bits as table indices, but not
really all that much. At least half of the cost difference, and
probably a lot more, is not the pointer/offset chasing but the
validation (hardening). If hypothetically you wanted to turn that all
off (e.g. by defining the assert macro to a no-op) you could have it
be a lot faster, and still have low memory usage too. I'm not sure but
for single-threaded loads I would not be surprised if it were getting
close to tcmalloc speed. Just casually building with assert() defined
to nop out the tests, I got double performance on your TEST2. Of
course, I don't recommend doing this. But it's an interesting test for
performing *measurement* (which you so far refuse to do) of what's
actually making the performance differences.

> On the other hand, if the low speed is not caused by having to
> return 6700, then we should be able to use a similar quick lookup
> table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve
> at least dozens of times performance improvement.

Once again, the big difference is not the "6700". The
tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines
6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of
the work here is in lines 6-11, mainly lines 7 and 10. I did a similar
casual test removing line 12 and just returning something based on the
earlier computations, and it made something like a 30% reduction in
test run time (with or without the hardening asserts nopped out).

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  5:41                               ` Rich Felker
@ 2022-09-20  5:56                                 ` baiyang
  2022-09-20 12:16                                   ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20  5:56 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8157 bytes --]

> //     This multi-threaded access to the pagemap is safe for fairly
> //     subtle reasons.  We basically assume that when an object X is
> //     allocated by thread A and deallocated by thread B, there must
> //     have been appropriate synchronization in the handoff of object
> //     X from thread A to thread B.

Thanks for your information.
I feel this assumption is very reasonable: you can't have one thread doing "free(p)" while another thread is accessing the block pointed to by p without any synchronization mechanism at the same time. 

> but either way that's not compatible with small-memory-space systems or with nommu.
OK, that's reasonable. And again, thanks for your patience and time :-D

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 13:41
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote:
> > The ones that return some value larger than the requested size are
> > returning "the requested size, rounded up to a multiple of 16" or
> > similar. Not "the requested size plus 1500 bytes". 
> ...
> > They don't return 8100. They return something like 6608 or 6624.
> 
> No, AFAIK, There are many allocators whose return value of
> malloc_usable_size is 1KB (or more) larger than the requested value
> at malloc time.
> For Example: if you do "void* p = malloc(6700)" on tcmalloc, then
> "malloc_usable_size(p)" will return **8192**. Far more than just
> "rounded up to a multiple of 16".

OK, thanks for checking and correcting.

> > This does not follow at all. tcmalloc is fast because it does not have
> > global consistency, does not have any notable hardening, and (see the
> > name) keeps large numbers of freed slots *cached* to reuse, thereby
> > using lots of extra memory. Its malloc_usable_size is not fast because
> > of returning the wrong value, if it even does return the wrong value
> > (I have no idea). 
> 
> We don't need to refer to these features of tcmalloc, we only need
> to refer to its malloc_usable_size algorithm.

Those (mis)features are what provide a fast path here, regardless of
whether you care about them.

> > It's fast because they store the size in-band right
> > next to the allocated memory and trust that it's valid, rather than
> > computing it from out-of-band metadata that is not subject to
> > falsification unless the attacker already has nearly full control of
> > execution.
> 
> No, if I understand correctly, tcmalloce doesn't store the size
> in-band right next to the allocated memory. On the contrary, when
> executing malloc_usable_size(p) (actually GetSize(p)), it will first
> find the size class corresponding to p through a quick lookup table,
> and then return the length of the size class. See:
> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099

OK, I was confusing tcmalloc with the more conventional "thread-local
freelist caching on top of dlmalloc type base" allocator strategy.
Indeed tcmalloc however is one of the gigantic ones.

> My understanding: the biggest impediment to our inability to apply
> similar optimizations is that we have to return 6700, not 8192 (of
> course, you've denied this is the reason).

Your understanding is wrong. I've told you how you can measure that
it's wrong. You insist on being stuck on it for no good reason.

If you want to understand *why* tcmalloc is different, start with the
comments at the top of the file you linked:

> //  4. The pagemap (which maps from page-number to descriptor),
> //     can be read without holding any locks, and written while holding
> //     the "pageheap_lock".
> //
> //     This multi-threaded access to the pagemap is safe for fairly
> //     subtle reasons.  We basically assume that when an object X is
> //     allocated by thread A and deallocated by thread B, there must
> //     have been appropriate synchronization in the handoff of object
> //     X from thread A to thread B.

This is the kind of thing I mean by lack of global consistency (no
synchronization around access to these data structures) and lack of
any meaningful hardening (*assuming* no memory lifetime usage errors
in the calling application).

The GetSize function you cited uses this global pagemap to go straight
from a page address to a sizeclass, via what amounts to a two-level or
three-level table indexed by upper bits of the address (comment says
3-level is only used in slower but lower-mem-use configuration). These
tables, at least in the 2-level form, are utterly *massive*. I'm not
sure if it creates them PROT_NONE and then only instantiates real
memory for the (initially fairly sparse) parts that get used, or if it
just allocates these giant things relying on overcommit, but either
way that's not compatible with small-memory-space systems or with
nommu.

On top of that, this approach relies on laying out whole pages (likely
large slabs of many pages at a time) of identical-sized objects so
that the size and other properties can be looked up by page number. I
have not looked into the details of "how bad" it gets, but it
completely precludes having any small processes, and precludes
promptly returning freed memory to the system, since *changing* the
pagemap is going to be costly and they're going to avoid doing it
(note the above comment on locking).

mallocng does not have any global mapping optimizing translation from
addresses to groups/metadata objects. Because we insist on global
consistency (a prerequisite for being able to define strong hardening
properties) and on being able to return freed memory promptly to the
system, maintaining such a data structure would cost a lot more time
(performance) than anything it could give, and it would make lock-free
operations (like your malloc_usable_size, or trivial realloc calls)
potentially require locking.

Instead of using the numeric value of the address to map to metadata,
we chase offsets from the object base address to the metadata, then
validate that it round-trips back to conclude that we didn't just
follow random junk from the caller passing an invalid/dangling pointer
in, or from this data being overwritten via heap-based buffer
overflows.

Fundamentally, this pointer chasing is going to be a little bit more
expensive than just using address bits as table indices, but not
really all that much. At least half of the cost difference, and
probably a lot more, is not the pointer/offset chasing but the
validation (hardening). If hypothetically you wanted to turn that all
off (e.g. by defining the assert macro to a no-op) you could have it
be a lot faster, and still have low memory usage too. I'm not sure but
for single-threaded loads I would not be surprised if it were getting
close to tcmalloc speed. Just casually building with assert() defined
to nop out the tests, I got double performance on your TEST2. Of
course, I don't recommend doing this. But it's an interesting test for
performing *measurement* (which you so far refuse to do) of what's
actually making the performance differences.

> On the other hand, if the low speed is not caused by having to
> return 6700, then we should be able to use a similar quick lookup
> table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve
> at least dozens of times performance improvement.

Once again, the big difference is not the "6700". The
tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines
6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of
the work here is in lines 6-11, mainly lines 7 and 10. I did a similar
casual test removing line 12 and just returning something based on the
earlier computations, and it made something like a 30% reduction in
test run time (with or without the hardening asserts nopped out).

Rich

[-- Attachment #2: Type: text/html, Size: 13177 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 13:46     ` Rich Felker
  2022-09-19 13:53       ` James Y Knight
@ 2022-09-20  8:33       ` Florian Weimer
  2022-09-20 13:54         ` Siddhesh Poyarekar
  1 sibling, 1 reply; 68+ messages in thread
From: Florian Weimer @ 2022-09-20  8:33 UTC (permalink / raw)
  To: Rich Felker; +Cc: baiyang, musl, Siddhesh Poyarekar

* Rich Felker:

> On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>> > unlike musl those implementations don't return exact size nor have the
>> > same security and memory fragmentation guarantees, so bad comparision.
>> >
>> > tcmalloc:
>> >   // Returns the actual number N of bytes reserved by tcmalloc for the pointer
>> >   // p.  This number may be equal to or greater than the number of bytes
>> >   // requested when p was allocated.
>> >   //
>> >   // This function is just useful for statistics collection.  The client must
>> >   // *not* read or write from the extra bytes that are indicated by this call.
>> >
>> > jemalloc:
>> >       <para>The <function>malloc_usable_size()</function> function
>> >       returns the usable size of the allocation pointed to by
>> >       <parameter>ptr</parameter>.  The return value may be larger than the size
>> >       that was requested during allocation.  The
>> >       <function>malloc_usable_size()</function> function is not a
>> >       mechanism for in-place <function>realloc()</function>; rather
>> >       it is provided solely as a tool for introspection purposes.  Any
>> >       discrepancy between the requested allocation size and the size reported
>> >       by <function>malloc_usable_size()</function> should not be
>> >       depended on, since such behavior is entirely implementation-dependent.
>> 
>> These implementations are buggy or at least mis-documented.  The
>> interface contract is clearly that for that particular object, the extra
>> bytes in the allocation are available for reading and writing.  It is
>> not guaranteed that the allocator will always provide the same number of
>> extra bytes for the same requested size, but they must be there for the
>> allocation being examined.  It's even in the name of the function!
>
> I'm not sure I understand what you're saying, but the core problem
> that really can't be solved is potential discrepancy between the
> malloc implementation's idea of usable and the compiler's. For
> example:
>
> 	char *p = malloc(1);
> 	if (malloc_usable_size(p)>1) p[1] = 42;
>
> will cause a compiler that's actively detecting UB to abort the
> program when malloc_usable_size returns a value larger than 1.

The compiler needs to treat malloc_usable_size similar to realloc and
just the size information for the buffer based on the return value from
malloc_usable_size.  This is admittedly harder to do than a comparable
analysis for realloc if the compiler interprets the standard in such a
way that after a successful realloc, any access to the original pointer
value is undefined.

malloc_usable_size is not actually *that* useful with allocators that do
not have strict size classes because they do not over-allocate that
much.  For these allocators, it may be possible to increase the size of
allocation significantly without moving it, but that is not reflected in
the return value of malloc_usable_size at all.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-19 20:46             ` Nat!
@ 2022-09-20  8:51               ` Szabolcs Nagy
  0 siblings, 0 replies; 68+ messages in thread
From: Szabolcs Nagy @ 2022-09-20  8:51 UTC (permalink / raw)
  To: Nat!; +Cc: musl

* Nat! <nat@mulle-kybernetik.com> [2022-09-19 22:46:20 +0200]:
> 
> On 19.09.22 20:14, Szabolcs Nagy wrote:
> > * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]:
> > > I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ).
> > > 
> > > On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499
> > > 
> > > I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault.
> > obviously internally the implementation can use the internal chunk size...
> > 
> > GetSize(p) is not the exact size (that the user allocated) but an internal
> > size (which may be larger) and that must not be exposed *outside* of the
> > malloc implementation (other than for diagnostic purposes).
> > 
> > you can have 2 views:
> > 
> > (1) tcmalloc and jemalloc are buggy because they expose an internal
> >      that must not be exposed (becaues it can break user code).
> > 
> > (2) user code is buggy if it uses malloc_usable_size for any purpose
> >      other than diagnostic/statistics (because other uses are broken
> >      on many implementations).
> > 
> > either way the brokenness you want to support is a security hazard
> > and you are lucky that musl saves the day: it works hard not to
> > expose internal sizes so the code you seem to care about can operate
> > safely (which is not true on tcmalloc and jemalloc: the compiler
> > may break that code).
> > 
> You can also have the third view, that malloc is allocating "at least" the
> amount of size requested (as it technically it is likely to do). That you
> can use "malloc_usable_size" to get the actually available size. That the
> code that is enforcing the semantics, that only the "at least" bytes should
> be accessed is in error, unless the error checking code modifies
> "malloc_usable_size" to only return the size as requested by the user.
> 
> Surely not a popular opinion :D :D

that third option is

(3) the compiler must not optimize based on the malloc interface
    contract. (just treat it like a normal extern call.)

which currently only happens at -O0 or -fno-builtin-malloc.
(although it seems those dont affect __builtin_dynamic_object_size)

note: the compiler cannot know about all non-standard extensions
and even if it did it cannot prove the absence of their call if
a pointer escapes. this means lot of malloc related optimizations
and diagnostics have to be turned off for everyone just in case
there is a m_u_s call. while this is a valid option it's not on
the libc to implement but on the compiler, which we cant control.
even if compiler devs agree with (3) we still cannot expose users
to security risk who use older compilers for some time. i think
it's not likely gcc or clang would accept (3) by default.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  5:56                                 ` baiyang
@ 2022-09-20 12:16                                   ` Rich Felker
  2022-09-20 17:21                                     ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-20 12:16 UTC (permalink / raw)
  To: baiyang; +Cc: musl

On Tue, Sep 20, 2022 at 01:56:12PM +0800, baiyang wrote:
> > //     This multi-threaded access to the pagemap is safe for fairly
> > //     subtle reasons.  We basically assume that when an object X is
> > //     allocated by thread A and deallocated by thread B, there must
> > //     have been appropriate synchronization in the handoff of object
> > //     X from thread A to thread B.
> 
> Thanks for your information.
> I feel this assumption is very reasonable: you can't have one thread
> doing "free(p)" while another thread is accessing the block pointed
> to by p without any synchronization mechanism at the same time.

That's a correct assumption given that the program's behavior is
defined. The problem is that once you assume that, you've completely
lost control over what happens when the program's behavior is
undefined due to memory lifetime bugs in the application (UAF/DF/etc)
and an allocator that does this will necessarily allow bugs in the
lifetimes of objects that aren't inherently exploitable on the
application level (don't contain pointers that will be written through
that might clobber other application state, etc.) to be used to gain
control over the allocator state and eventually gain control over the
flow of execution through manipulating other allocated objects that
are attack vectors.

A hardened allocator like mallocng or hardened_malloc does not make
assumptions about the validity of data reachable for clobbering
through application bugs that haven't already yielded a high level of
control to the attacker. Instead, the metadata assumed to be valid is
at "secret locations" outside the area where application data is
stored that are intended not to be determinable without already having
some strong level of control over execution flow. And on top of that,
it's cross-validated as much as possible.

We also do validation where we can of the "in-band" data that is
easily reachable by overflows in application buffers, etc. This both
blocks a lot of potential exploits, and catches application bugs that
could be exploitable before they turn into exploits, by having them
crash on developers'/packagers' systems before they're deployed and
getting the root cause investigated and fixed.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20  8:33       ` Florian Weimer
@ 2022-09-20 13:54         ` Siddhesh Poyarekar
  2022-09-20 16:59           ` James Y Knight
                             ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Siddhesh Poyarekar @ 2022-09-20 13:54 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Rich Felker, baiyang, musl

On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
> The compiler needs to treat malloc_usable_size similar to realloc and
> just the size information for the buffer based on the return value from
> malloc_usable_size.  This is admittedly harder to do than a comparable
> analysis for realloc if the compiler interprets the standard in such a
> way that after a successful realloc, any access to the original pointer
> value is undefined.
>
> malloc_usable_size is not actually *that* useful with allocators that do
> not have strict size classes because they do not over-allocate that
> much.  For these allocators, it may be possible to increase the size of
> allocation significantly without moving it, but that is not reflected in
> the return value of malloc_usable_size at all.

So the glibc manual does not document malloc_usable_size semantics
(which is weird since it is, well, a GNU extension!) so the only
reference users have is the man page.  The man page already
discourages use of malloc_usable_size to write beyond the allocated
size in the NOTES section:

    The  value  returned by malloc_usable_size() may be greater than
the requested size of the allocation because of alignment and minimum
size constraints.  Although the excess bytes can be overwritten by the
application without ill effects, this is not good programming
practice: the number of excess bytes in an allocation depends on the
underlying implementation.

Adding support for something that's already declared as bad
programming practice seems like a step backwards.  Instead, I hope we
find a way to discourage active use of malloc_usable_size more
strongly.  At least based on the systemd experience, the problem they
try to solve is that of glibc realloc being too slow for paths where
the reallocation should return the same block and that should be easy
to special-case.  Is there any other valid reason to use
malloc_usable_size instead of simply using realloc?

Sid

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 13:54         ` Siddhesh Poyarekar
@ 2022-09-20 16:59           ` James Y Knight
  2022-09-20 17:34             ` Szabolcs Nagy
  2022-09-20 17:39             ` baiyang
  2022-09-20 17:28           ` baiyang
  2022-10-10 14:13           ` Florian Weimer
  2 siblings, 2 replies; 68+ messages in thread
From: James Y Knight @ 2022-09-20 16:59 UTC (permalink / raw)
  To: musl; +Cc: Florian Weimer, Rich Felker, baiyang

[-- Attachment #1: Type: text/plain, Size: 2249 bytes --]

On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com>
wrote:

> Adding support for something that's already declared as bad
> programming practice seems like a step backwards.  Instead, I hope we
> find a way to discourage active use of malloc_usable_size more
> strongly.

BTW, if folks aren't aware, there is already work on the C++ side to expose
an API which lets you request a heap allocation of _at least_ the given
size, which rounds the actual size up in whatever way the allocator likes,
and returns the pointer and actual size allocated. With this API, you
declare an explicit intent that all of the memory -- up to the returned
size -- is valid to use without needing to go back to the allocator to ask
for more.

The proposal is still making its way through the standardization process,
but hopefully it'll make it into the next version of C++ after C++23.  (Of
course, that's not a sure thing until it happens.) Here's the doc, with
more rationale/etc:
  https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html

Also, as noted in the doc, jemalloc experimentally implemented this
functionality in its non-standard API, via a function it called "smallocx"
-- though jemalloc hides the API so it can't be used by default. The API is
effectively:
  typedef struct { void *ptr; size_t size; } smallocx_return_t;
  smallocx_return_t smallocx(size_t size, int flags);
https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414
(That's consistent with jemalloc's other non-standard APIs, which stick
alignment/etc into a "flags" argument, but probably not suitable for a
more-standardized cross-implementation API)

tcmalloc implements similar functionality, as well, with family of
functions named "tcmalloc_size_returning_operator_new":

https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549
which of course also isn't a suitable API to support cross-implementation.

If someone wants to push forward this area, IMO, it would be really great
to have an API exposing this functionality designed to be implemented in a
common way across libc malloc implementations -- and eventually added to
POSIX or C.

[-- Attachment #2: Type: text/html, Size: 3159 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 12:16                                   ` Rich Felker
@ 2022-09-20 17:21                                     ` baiyang
  0 siblings, 0 replies; 68+ messages in thread
From: baiyang @ 2022-09-20 17:21 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 3402 bytes --]

> The problem is that once you assume that, you've completely
> lost control over what happens when the program's behavior is
> undefined due to memory lifetime bugs in the application (UAF/DF/etc) 

Yes, but that's exactly the contract between the C language and the programmer: to leave everything in the hands of people for maximum control and efficiency.

If we need more "protection", then we can just use other more advanced (and inefficient) languages.

Also, because of the enormous flexibility that C gives the programmer (and also hugely destructive when used incorrectly). So I'm afraid that the various checks we do at such a high cost may also be "better than nothing" to prevent bugs?

Of course, these tradeoffs are indeed a delicate matter. It's hard to have a definitive answer.

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Rich Felker
Date: 2022-09-20 20:16
To: baiyang
CC: musl
Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 01:56:12PM +0800, baiyang wrote:
> > //     This multi-threaded access to the pagemap is safe for fairly
> > //     subtle reasons.  We basically assume that when an object X is
> > //     allocated by thread A and deallocated by thread B, there must
> > //     have been appropriate synchronization in the handoff of object
> > //     X from thread A to thread B.
> 
> Thanks for your information.
> I feel this assumption is very reasonable: you can't have one thread
> doing "free(p)" while another thread is accessing the block pointed
> to by p without any synchronization mechanism at the same time.

That's a correct assumption given that the program's behavior is
defined. The problem is that once you assume that, you've completely
lost control over what happens when the program's behavior is
undefined due to memory lifetime bugs in the application (UAF/DF/etc)
and an allocator that does this will necessarily allow bugs in the
lifetimes of objects that aren't inherently exploitable on the
application level (don't contain pointers that will be written through
that might clobber other application state, etc.) to be used to gain
control over the allocator state and eventually gain control over the
flow of execution through manipulating other allocated objects that
are attack vectors.

A hardened allocator like mallocng or hardened_malloc does not make
assumptions about the validity of data reachable for clobbering
through application bugs that haven't already yielded a high level of
control to the attacker. Instead, the metadata assumed to be valid is
at "secret locations" outside the area where application data is
stored that are intended not to be determinable without already having
some strong level of control over execution flow. And on top of that,
it's cross-validated as much as possible.

We also do validation where we can of the "in-band" data that is
easily reachable by overflows in application buffers, etc. This both
blocks a lot of potential exploits, and catches application bugs that
could be exploitable before they turn into exploits, by having them
crash on developers'/packagers' systems before they're deployed and
getting the root cause investigated and fixed.

Rich

[-- Attachment #2: Type: text/html, Size: 6445 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 13:54         ` Siddhesh Poyarekar
  2022-09-20 16:59           ` James Y Knight
@ 2022-09-20 17:28           ` baiyang
  2022-09-20 17:44             ` Siddhesh Poyarekar
  2022-10-10 14:13           ` Florian Weimer
  2 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20 17:28 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Florian Weimer; +Cc: Rich Felker, musl

[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]

> Is there any other valid reason to use
> malloc_usable_size instead of simply using realloc? 

Yes, we use it as one of the parameters to estimate the memory copy cost when realloc degenerates back to malloc+memcpy+free, refer to the previous mails. 

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Siddhesh Poyarekar
Date: 2022-09-20 21:54
To: Florian Weimer
CC: Rich Felker; baiyang; musl
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
> The compiler needs to treat malloc_usable_size similar to realloc and
> just the size information for the buffer based on the return value from
> malloc_usable_size.  This is admittedly harder to do than a comparable
> analysis for realloc if the compiler interprets the standard in such a
> way that after a successful realloc, any access to the original pointer
> value is undefined.
>
> malloc_usable_size is not actually *that* useful with allocators that do
> not have strict size classes because they do not over-allocate that
> much.  For these allocators, it may be possible to increase the size of
> allocation significantly without moving it, but that is not reflected in
> the return value of malloc_usable_size at all.
 
So the glibc manual does not document malloc_usable_size semantics
(which is weird since it is, well, a GNU extension!) so the only
reference users have is the man page.  The man page already
discourages use of malloc_usable_size to write beyond the allocated
size in the NOTES section:
 
    The  value  returned by malloc_usable_size() may be greater than
the requested size of the allocation because of alignment and minimum
size constraints.  Although the excess bytes can be overwritten by the
application without ill effects, this is not good programming
practice: the number of excess bytes in an allocation depends on the
underlying implementation.
 
Adding support for something that's already declared as bad
programming practice seems like a step backwards.  Instead, I hope we
find a way to discourage active use of malloc_usable_size more
strongly.  At least based on the systemd experience, the problem they
try to solve is that of glibc realloc being too slow for paths where
the reallocation should return the same block and that should be easy
to special-case.  Is there any other valid reason to use
malloc_usable_size instead of simply using realloc?
 
Sid
 

[-- Attachment #2: Type: text/html, Size: 5037 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 16:59           ` James Y Knight
@ 2022-09-20 17:34             ` Szabolcs Nagy
  2022-09-20 19:53               ` James Y Knight
  2022-09-24  8:55               ` Fangrui Song
  2022-09-20 17:39             ` baiyang
  1 sibling, 2 replies; 68+ messages in thread
From: Szabolcs Nagy @ 2022-09-20 17:34 UTC (permalink / raw)
  To: James Y Knight; +Cc: musl, Florian Weimer, Rich Felker, baiyang

* James Y Knight <jyknight@google.com> [2022-09-20 12:59:00 -0400]:

> On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com>
> wrote:
> 
> > Adding support for something that's already declared as bad
> > programming practice seems like a step backwards.  Instead, I hope we
> > find a way to discourage active use of malloc_usable_size more
> > strongly.
> 
> 
> BTW, if folks aren't aware, there is already work on the C++ side to expose
> an API which lets you request a heap allocation of _at least_ the given
> size, which rounds the actual size up in whatever way the allocator likes,
> and returns the pointer and actual size allocated. With this API, you
> declare an explicit intent that all of the memory -- up to the returned
> size -- is valid to use without needing to go back to the allocator to ask
> for more.
> 
> The proposal is still making its way through the standardization process,
> but hopefully it'll make it into the next version of C++ after C++23.  (Of
> course, that's not a sure thing until it happens.) Here's the doc, with
> more rationale/etc:
>   https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html

this does not seem to discuss how existing applications
that override new() would cope with this.

nor how existing implementations on top of c allocators
would implement it (given that we just agreed that
malloc_usable_size is not suitable for such use).

nor how existing allocator tooling (interposers, profilers)
would handle the new interface.

> 
> Also, as noted in the doc, jemalloc experimentally implemented this
> functionality in its non-standard API, via a function it called "smallocx"
> -- though jemalloc hides the API so it can't be used by default. The API is
> effectively:
>   typedef struct { void *ptr; size_t size; } smallocx_return_t;
>   smallocx_return_t smallocx(size_t size, int flags);
> https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414
> (That's consistent with jemalloc's other non-standard APIs, which stick
> alignment/etc into a "flags" argument, but probably not suitable for a
> more-standardized cross-implementation API)
> 
> tcmalloc implements similar functionality, as well, with family of
> functions named "tcmalloc_size_returning_operator_new":

so there are already incompatible c apis, which means this
should not be considered a viable proposal at this point.

> 
> https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549
> which of course also isn't a suitable API to support cross-implementation.
> 
> If someone wants to push forward this area, IMO, it would be really great
> to have an API exposing this functionality designed to be implemented in a
> common way across libc malloc implementations -- and eventually added to
> POSIX or C.

this is done the wrong way around.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 16:59           ` James Y Knight
  2022-09-20 17:34             ` Szabolcs Nagy
@ 2022-09-20 17:39             ` baiyang
  2022-09-20 18:12               ` Quentin Rameau
  1 sibling, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20 17:39 UTC (permalink / raw)
  To: James Y Knight, musl; +Cc: Florian Weimer, Rich Felker

[-- Attachment #1: Type: text/plain, Size: 3426 bytes --]

> tcmalloc implements similar functionality, as well, with family of functions named "tcmalloc_size_returning_operator_new"

This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**.

As in our previous example, suppose we "void* p malloc(200KB)" and "malloc_usable_size(p)" returns 256KB. For allocators such as tcmalloc, if we do "realloc(p, 300KB)", it will actually execute "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time, the actual business of the application layer often only needs to copy a small amount of content, such as the first 2KB of data.

Therefore, it would be great if there were additional parameters to inform realloc to just execute malloc(300KB)+memcpy(**2KB**)+free(256KB) when degenerating back to malloc+memcpy+free.

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: James Y Knight
Date: 2022-09-21 00:59
To: musl
CC: Florian Weimer; Rich Felker; baiyang
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
Adding support for something that's already declared as bad
programming practice seems like a step backwards.  Instead, I hope we
find a way to discourage active use of malloc_usable_size more
strongly.  

BTW, if folks aren't aware, there is already work on the C++ side to expose an API which lets you request a heap allocation of _at least_ the given size, which rounds the actual size up in whatever way the allocator likes, and returns the pointer and actual size allocated. With this API, you declare an explicit intent that all of the memory -- up to the returned size -- is valid to use without needing to go back to the allocator to ask for more.

The proposal is still making its way through the standardization process, but hopefully it'll make it into the next version of C++ after C++23.  (Of course, that's not a sure thing until it happens.) Here's the doc, with more rationale/etc:
  https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html

Also, as noted in the doc, jemalloc experimentally implemented this functionality in its non-standard API, via a function it called "smallocx" -- though jemalloc hides the API so it can't be used by default. The API is effectively:
  typedef struct { void *ptr; size_t size; } smallocx_return_t;
  smallocx_return_t smallocx(size_t size, int flags);
https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414
(That's consistent with jemalloc's other non-standard APIs, which stick alignment/etc into a "flags" argument, but probably not suitable for a more-standardized cross-implementation API)

tcmalloc implements similar functionality, as well, with family of functions named "tcmalloc_size_returning_operator_new":
  https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549
which of course also isn't a suitable API to support cross-implementation.

If someone wants to push forward this area, IMO, it would be really great to have an API exposing this functionality designed to be implemented in a common way across libc malloc implementations -- and eventually added to POSIX or C.

[-- Attachment #2: Type: text/html, Size: 6762 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 17:28           ` baiyang
@ 2022-09-20 17:44             ` Siddhesh Poyarekar
  0 siblings, 0 replies; 68+ messages in thread
From: Siddhesh Poyarekar @ 2022-09-20 17:44 UTC (permalink / raw)
  To: baiyang; +Cc: Florian Weimer, Rich Felker, musl

On Tue, Sep 20, 2022 at 1:28 PM baiyang <baiyang@gmail.com> wrote:
>
> > Is there any other valid reason to use
> > malloc_usable_size instead of simply using realloc?
>
> Yes, we use it as one of the parameters to estimate the memory copy cost when realloc degenerates back to malloc+memcpy+free, refer to the previous mails.

I was only pulled into this conversation with the email I responded
to; I am not subscribed to the musl mailing list.  If this estimation
is not for diagnostics then it's really not that different from the
case I'm talking about.  You shouldn't have to do this check and
realloc should do the right thing whenever there's space available in
the chunk.  I can't comment on musl-specific bits because I haven't
taken a close enough look at mallocng to do that.  In general though,
I don't think it makes sense for a C library to support
malloc_usable_size beyond diagnostics.

Sid

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 17:39             ` baiyang
@ 2022-09-20 18:12               ` Quentin Rameau
  2022-09-20 18:19                 ` Rich Felker
  0 siblings, 1 reply; 68+ messages in thread
From: Quentin Rameau @ 2022-09-20 18:12 UTC (permalink / raw)
  To: musl; +Cc: Florian Weimer, Rich Felker

> This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**.
> 
> As in our previous example, suppose we "void* p malloc(200KB)" and "malloc_usable_size(p)" returns 256KB. For allocators such as tcmalloc, if we do "realloc(p, 300KB)", it will actually execute "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time, the actual business of the application layer often only needs to copy a small amount of content, such as the first 2KB of data.

So what you're actually trying to do is more clear now.
And this is something that you should do on the “application layer”,
not expect the libc to magically taking care of this.

If your application already knows it only needs to keep back only a few amount of data, do this in a wrapper, malloc(300KB), memcpy(2KB), free old memory.
If on the other hand it know it actually wants to reuse the same data but in a more vast contiguous memory, use realloc.

> Therefore, it would be great if there were additional parameters to inform realloc to just execute malloc(300KB)+memcpy(**2KB**)+free(256KB) when degenerating back to malloc+memcpy+free.

That's what your wrapper would be.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:12               ` Quentin Rameau
@ 2022-09-20 18:19                 ` Rich Felker
  2022-09-20 18:26                   ` Alexander Monakov
  2022-09-21 10:15                   ` [musl] " 王志强
  0 siblings, 2 replies; 68+ messages in thread
From: Rich Felker @ 2022-09-20 18:19 UTC (permalink / raw)
  To: Quentin Rameau; +Cc: musl, Florian Weimer

On Tue, Sep 20, 2022 at 08:12:44PM +0200, Quentin Rameau wrote:
> > This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**.
> > 
> > As in our previous example, suppose we "void* p malloc(200KB)" and
> > "malloc_usable_size(p)" returns 256KB. For allocators such as
> > tcmalloc, if we do "realloc(p, 300KB)", it will actually execute
> > "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time,
> > the actual business of the application layer often only needs to
> > copy a small amount of content, such as the first 2KB of data.
> 
> So what you're actually trying to do is more clear now.
> And this is something that you should do on the “application layer”,
> not expect the libc to magically taking care of this.

Exactly. This can be done entirely at the application layer just by
keeping track of the size you allocated. In the above example, the
number 256 kB is a red herring. Yes the
"malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the
"malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful
when you only want to preserve the first 2K, and you can make the
decision that it would be wasteful, and that you instead just want to
allocate a new buffer yourself and memcpy 2K, just by knowing the
original 200KB, without any knowledge of malloc_usable_size.

As a bonus, this will be even faster than the fastest possible
malloc_usable_size implementation.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:19                 ` Rich Felker
@ 2022-09-20 18:26                   ` Alexander Monakov
  2022-09-20 18:35                     ` baiyang
  2022-09-20 18:37                     ` Quentin Rameau
  2022-09-21 10:15                   ` [musl] " 王志强
  1 sibling, 2 replies; 68+ messages in thread
From: Alexander Monakov @ 2022-09-20 18:26 UTC (permalink / raw)
  To: musl; +Cc: Quentin Rameau, Florian Weimer

On Tue, 20 Sep 2022, Rich Felker wrote:

> Exactly. This can be done entirely at the application layer just by
> keeping track of the size you allocated. In the above example, the
> number 256 kB is a red herring. Yes the
> "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the
> "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful
> when you only want to preserve the first 2K, and you can make the
> decision that it would be wasteful, and that you instead just want to
> allocate a new buffer yourself and memcpy 2K, just by knowing the
> original 200KB, without any knowledge of malloc_usable_size.

They want to know if realloc will resize the allocation in-place so
the internal memcpy will not happen.

AIUI, what they really need is not "usable_size", but "cost estimation
for resizing allocation at pointer P to size S". Which I believe they
try to deduce from malloc_usable_size.

Alexander

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:26                   ` Alexander Monakov
@ 2022-09-20 18:35                     ` baiyang
  2022-09-20 20:33                       ` Gabriel Ravier
  2022-09-20 18:37                     ` Quentin Rameau
  1 sibling, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20 18:35 UTC (permalink / raw)
  To: musl; +Cc: Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 2096 bytes --]

> So what you're actually trying to do is more clear now.
> And this is something that you should do on the “application layer”,
> not expect the libc to magically taking care of this.

Yes, we just do this at the application layer with the help of APIs such as malloc_usable_size. You can look at the previous emails, isn't that what we're discussing?

> They want to know if realloc will resize the allocation in-place so
> the internal memcpy will not happen.
> AIUI, what they really need is not "usable_size", but "cost estimation
> for resizing allocation at pointer P to size S". Which I believe they
> try to deduce from malloc_usable_size. 

Yes, from malloc_usable_size, the size that the application layer actually needs to copy, and other parameters like the OS, allocator, etc. 

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 
 
 
From: Alexander Monakov
Date: 2022-09-21 02:26
To: musl
CC: Quentin Rameau; Florian Weimer
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On Tue, 20 Sep 2022, Rich Felker wrote:
 
> Exactly. This can be done entirely at the application layer just by
> keeping track of the size you allocated. In the above example, the
> number 256 kB is a red herring. Yes the
> "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the
> "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful
> when you only want to preserve the first 2K, and you can make the
> decision that it would be wasteful, and that you instead just want to
> allocate a new buffer yourself and memcpy 2K, just by knowing the
> original 200KB, without any knowledge of malloc_usable_size.
 
They want to know if realloc will resize the allocation in-place so
the internal memcpy will not happen.
 
AIUI, what they really need is not "usable_size", but "cost estimation
for resizing allocation at pointer P to size S". Which I believe they
try to deduce from malloc_usable_size.
 
Alexander

[-- Attachment #2: Type: text/html, Size: 4559 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:26                   ` Alexander Monakov
  2022-09-20 18:35                     ` baiyang
@ 2022-09-20 18:37                     ` Quentin Rameau
  1 sibling, 0 replies; 68+ messages in thread
From: Quentin Rameau @ 2022-09-20 18:37 UTC (permalink / raw)
  To: musl

Hi Alexander,

> > Exactly. This can be done entirely at the application layer just by
> > keeping track of the size you allocated. In the above example, the
> > number 256 kB is a red herring. Yes the
> > "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the
> > "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful
> > when you only want to preserve the first 2K, and you can make the
> > decision that it would be wasteful, and that you instead just want to
> > allocate a new buffer yourself and memcpy 2K, just by knowing the
> > original 200KB, without any knowledge of malloc_usable_size.  
> 
> They want to know if realloc will resize the allocation in-place so
> the internal memcpy will not happen.
> 
> AIUI, what they really need is not "usable_size", but "cost estimation
> for resizing allocation at pointer P to size S". Which I believe they
> try to deduce from malloc_usable_size.

I see.

But then I doubt this is something you can reliably expect from any
implementation, as at best, one would just have to assume that the target
allocator works in one way and wouldn't change.
This will result in a code tightly bound to a specific alocator
because of the assumptions made about the library internals.

This is not really the kind of contract one makes with the POSIX/C library
in a portable way.

So if the request really is to press every libc implementation into
adopting one user's (or a few) self-made requirements,
that still doesn't sound like a constructive conversation here.

At this point if ones really needs micro-management of their memory,
they should just implement their own allocator.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 17:34             ` Szabolcs Nagy
@ 2022-09-20 19:53               ` James Y Knight
  2022-09-24  8:55               ` Fangrui Song
  1 sibling, 0 replies; 68+ messages in thread
From: James Y Knight @ 2022-09-20 19:53 UTC (permalink / raw)
  To: James Y Knight, musl, Florian Weimer, Rich Felker, baiyang

[-- Attachment #1: Type: text/plain, Size: 2048 bytes --]

On Tue, Sep 20, 2022 at 1:34 PM Szabolcs Nagy <nsz@port70.net> wrote:

> this does not seem to discuss how existing applications
> that override new() would cope with this.

I'm certain this isn't the right forum to discuss the C++-specific parts of
the proposal, nor am I the right person to do so even if it were. I brought
it up to potentially inform the discussion of C library APIs, not to debate
whether the C++ proposal itself is ready (it's clearly not: it didn't make
it into C++23).

nor how existing implementations on top of c allocators
> would implement it (given that we just agreed that
> malloc_usable_size is not suitable for such use).

As the doc says, implementations should use the trivial implementation
which returns exactly the requested size (e.g. `return {::operator
new(size), size};`), if no better option is available in the underlying
allocator.

nor how existing allocator tooling (interposers, profilers)
> would handle the new interface.
>

User replacement of the C++ global allocators is supported in the standard,
but interposition of malloc is only an implementation-defined extension. In
practice (and as documented by glibc), a malloc interposer must interpose
upon the complete set of malloc-APIs which are ever used anywhere in the
binary. (Of course, the base malloc implementation CAN harden itself
against incomplete interposition if it wishes -- as musl does to support
interposers that fail to define aligned_alloc.)

> If someone wants to push forward this area, IMO, it would be really great
> > to have an API exposing this functionality designed to be implemented in
> a
> > common way across libc malloc implementations -- and eventually added to
> > POSIX or C.

this is done the wrong way around.

Eh. Standardization often goes better when there is already concrete
implementation experience, and some level of agreement, rather than doing
brand new design. That said, I didn't intend to prescribe how someone else
should best go about making this happen, if indeed anyone wishes to do so.

[-- Attachment #2: Type: text/html, Size: 3316 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:35                     ` baiyang
@ 2022-09-20 20:33                       ` Gabriel Ravier
  2022-09-20 20:45                         ` baiyang
  0 siblings, 1 reply; 68+ messages in thread
From: Gabriel Ravier @ 2022-09-20 20:33 UTC (permalink / raw)
  To: musl, baiyang; +Cc: Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On 9/20/22 20:35, baiyang wrote:
> > So what you're actually trying to do is more clear now.
> > And this is something that you should do on the “application layer”,
> > not expect the libc to magically taking care of this.
>
> Yes, we just do this at the application layer with the help of APIs 
> such as malloc_usable_size. You can look at the previous emails, isn't 
> that what we're discussing?
>
> > They want to know if realloc will resize the allocation in-place so
> > the internal memcpy will not happen.
> > AIUI, what they really need is not "usable_size", but "cost estimation
> > for resizing allocation at pointer P to size S". Which I believe they
> > try to deduce from malloc_usable_size.
>
> Yes, from malloc_usable_size, the size that the application layer 
> actually needs to copy, and other parameters like the OS, allocator, etc.
>
At this point it feels more and more like malloc basically can't fulfill 
your needs unless every implementation out there starts adding 
specialized methods for your application. In other words, it might be 
more productive for you to simply bring in a more agreeable allocator, 
instead of trying to get every single allocator out there to bow down to 
your needs.

[-- Attachment #2: Type: text/html, Size: 2780 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 20:33                       ` Gabriel Ravier
@ 2022-09-20 20:45                         ` baiyang
  2022-09-21  8:42                           ` NRK
  0 siblings, 1 reply; 68+ messages in thread
From: baiyang @ 2022-09-20 20:45 UTC (permalink / raw)
  To: Gabriel Ravier, musl; +Cc: Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]

> At this point it feels more and more like malloc basically can't fulfill your needs unless every implementation out there starts adding specialized methods for your application. In other words, it might be more productive for you to simply bring in a more agreeable allocator, instead of trying to get every single allocator out there to bow down to your needs.

Yes, we will adapt as many allocators as possible to make our software more portable and adaptable.

Frankly, so far, we've only encountered malloc_usable_size performance issues with musl's mallocng (and, as we discussed earlier, this performance issue affects both realloc and free). 

Of course, we also respect and understand the author's considerations and trade-offs, and have avoided calling malloc_usable_size for musl's mallocng in our software.

Thanks :-)

--

   Best Regards
  BaiYang
  baiyang@gmail.com
  http://i.baiy.cn
**** < END OF EMAIL > **** 

From: Gabriel Ravier
Date: 2022-09-21 04:33
To: musl; baiyang
CC: Quentin Rameau; Florian Weimer
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
On 9/20/22 20:35, baiyang wrote:
> So what you're actually trying to do is more clear now.
> And this is something that you should do on the “application layer”,
> not expect the libc to magically taking care of this.

Yes, we just do this at the application layer with the help of APIs such as malloc_usable_size. You can look at the previous emails, isn't that what we're discussing?

> They want to know if realloc will resize the allocation in-place so
> the internal memcpy will not happen.
> AIUI, what they really need is not "usable_size", but "cost estimation
> for resizing allocation at pointer P to size S". Which I believe they
> try to deduce from malloc_usable_size. 

Yes, from malloc_usable_size, the size that the application layer actually needs to copy, and other parameters like the OS, allocator, etc. 

At this point it feels more and more like malloc basically can't fulfill your needs unless every implementation out there starts adding specialized methods for your application. In other words, it might be more productive for you to simply bring in a more agreeable allocator, instead of trying to get every single allocator out there to bow down to your needs.

[-- Attachment #2: Type: text/html, Size: 5654 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 20:45                         ` baiyang
@ 2022-09-21  8:42                           ` NRK
  0 siblings, 0 replies; 68+ messages in thread
From: NRK @ 2022-09-21  8:42 UTC (permalink / raw)
  To: musl

On Wed, Sep 21, 2022 at 04:45:59AM +0800, baiyang wrote:
> Yes, we will adapt as many allocators as possible to make our software
> more portable and adaptable.
>
> Of course, we also respect and understand the author's considerations
> and trade-offs, and have avoided calling malloc_usable_size for musl's
> mallocng in our software.

Here's a better idea, malloc (and malloc_usable_size) cannot be slow if
you never call it; so just skip it all together and grab the needed
pages straight from the kernel (mmap(2) on linux/posix and on windows I
think it's VirtualAlloc).

The malloc API was designed for ease of use, not performance. And it's
evident from the fact that user has no control over malloc's metadata.

And what you're doing with malloc_usable_size() is not an
"optimization", it's a heuristic at best and hack that'll actually
*pessimize* at worse. Proof? This very thread.

Also malloc_usable_size() does not tell you anything about weather
realloc will grow in place (or avoid copy entirely via virtual address
space shenanigans).

If you need such fine grained control then you're using the wrong API,
you either need to manage your memory directly or find an allocator
library which *actually* (as in documented in the API contract) offers
such control rather than relying on unreliable heuristics.

- NRK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 18:19                 ` Rich Felker
  2022-09-20 18:26                   ` Alexander Monakov
@ 2022-09-21 10:15                   ` 王志强
  2022-09-21 16:11                     ` [musl] " 王志强
  2022-09-21 17:15                     ` [musl] " Rich Felker
  1 sibling, 2 replies; 68+ messages in thread
From: 王志强 @ 2022-09-21 10:15 UTC (permalink / raw)
  To: musl; +Cc: Quentin Rameau, Florian Weimer, dalias

[-- Attachment #1: Type: text/plain, Size: 2549 bytes --]

Hi Rich,

I am quite interested into the topic,  and made a comparation between glibc and musl with following code:
#define MAXF 4096
void* tobefree[MAXF];
int main() {
    long long i;
    int v, k;
    size_t s, c=0;
    char *p;
    for (i=0; i<100000000L; i++) {
        v = rand();   
        s = ((v%256)+1)*1024;
        p = (char*) malloc(s);
        p[1023]=0;
        if (c>=MAXF) {
            k = v%c;
            free(tobefree[k]);
            tobefree[k]=tobefree[--c];
        }
        tobefree[c++]=p;
    }
    return 0;
}
```

The results show signaficant difference.
With glibc, (running within a debian docker image)
# gcc -o m.debian -O0 app_malloc.c

# time ./m.debian
real    0m37.529s
user    0m36.677s
sys    0m0.771s

With musl, (runnign within a alpine3.15 docker image)

# gcc -o m.alpine -O0 app_malloc.c

# time ./m.alpine
real    6m 30.51s
user    1m 36.67s
sys    4m 53.31s

musl seems spend way too much time within kernel, while glibc hold most work within userspace.
I used perf_event_open to profile those programs:
musl profiling(total  302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap

munmap(30.858% 93469/302899)
_init?(22.583% 68404/302899)
aligned_alloc?(89.290% 61078/68404)
asm_exc_page_fault(45.961% 28072/61078)
main(9.001% 6157/68404)
asm_exc_page_fault(29.170% 1796/6157)
rand(1.266% 866/68404)
aligned_alloc?(20.437% 61904/302899)
asm_exc_page_fault(56.038% 34690/61904)
madvise(13.275% 40209/302899)
mmap64(11.125% 33698/302899)

But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"

pthread_attr_setschedparam?(82.021% 23845/29072)
asm_exc_page_fault(1.657% 395/23845)
_dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
cfree(58.839% 2859/4859)
main(31.138% 1513/4859)
asm_exc_page_fault(2.115% 32/1513)
pthread_attr_setschedparam?(3.725% 181/4859)
random(2.099% 102/4859)
random_r(1.832% 89/4859)
__libc_malloc(1.420% 69/4859)
It seems to be me, glibc make lots of uasage of cache of kernel memory and avoid lots of pagefault and syscalls.
Is this performance difference should concern realworld applications?  On average, musl actual spend about 3~4ns per malloc/free, which is quite acceptable in realworld applications, I think.

(Seems to me, that the performance difference has nothing to do with malloc_usable_size,   which may be  indeed just a speculative guess without any base)

David Wang

[-- Attachment #2: Type: text/html, Size: 6435 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [musl] Re:[musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-21 10:15                   ` [musl] " 王志强
@ 2022-09-21 16:11                     ` 王志强
  2022-09-21 17:15                     ` [musl] " Rich Felker
  1 sibling, 0 replies; 68+ messages in thread
From: 王志强 @ 2022-09-21 16:11 UTC (permalink / raw)
  To: musl; +Cc: Quentin Rameau, Florian Weimer, dalias

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]




At 2022-09-21 18:15:02, "王志强" <00107082@163.com> wrote:



But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"



pthread_attr_setschedparam?(82.021% 23845/29072)
asm_exc_page_fault(1.657% 395/23845)
_dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
cfree(58.839% 2859/4859)
main(31.138% 1513/4859)
asm_exc_page_fault(2.115% 32/1513)
pthread_attr_setschedparam?(3.725% 181/4859)
random(2.099% 102/4859)
random_r(1.832% 89/4859)
__libc_malloc(1.420% 69/4859)


update:  glibc spend most of its time within malloc,  82% samples, not the pagefault routine (only 1%), I misread the report in my last mail. 



__libc_malloc?(82.195% 20700/25184)
asm_exc_page_fault(1.589% 329/20700)
_dl_catch_error??(16.717% 4210/25184)__libc_start_main(100.000% 4210/4210)
cfree(56.508% 2379/4210)
main(32.660% 1375/4210)
asm_exc_page_fault(2.327% 32/1375)
__libc_malloc?(4.418% 186/4210)
random(2.375% 100/4210)
random_r(1.805% 76/4210)
__libc_malloc(1.116% 47/4210)
David

[-- Attachment #2: Type: text/html, Size: 3852 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-21 10:15                   ` [musl] " 王志强
  2022-09-21 16:11                     ` [musl] " 王志强
@ 2022-09-21 17:15                     ` Rich Felker
  2022-09-21 17:58                       ` Rich Felker
  1 sibling, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-21 17:15 UTC (permalink / raw)
  To: 王志强; +Cc: musl, Quentin Rameau, Florian Weimer

On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote:
> Hi Rich,
> 
> 
> 
> I am quite interested into the topic,  and made a comparation between glibc and musl with following code:
> #define MAXF 4096
> void* tobefree[MAXF];
> int main() {
>     long long i;
>     int v, k;
>     size_t s, c=0;
>     char *p;
>     for (i=0; i<100000000L; i++) {
>         v = rand();   
>         s = ((v%256)+1)*1024;
>         p = (char*) malloc(s);
>         p[1023]=0;
>         if (c>=MAXF) {
>             k = v%c;
>             free(tobefree[k]);
>             tobefree[k]=tobefree[--c];
>         }
>         tobefree[c++]=p;
>     }
>     return 0;
> }
> ```
> 
> The results show signaficant difference.
> With glibc, (running within a debian docker image)
> # gcc -o m.debian -O0 app_malloc.c
> 
> # time ./m.debian
> real    0m37.529s
> user    0m36.677s
> sys    0m0.771s
> 
> With musl, (runnign within a alpine3.15 docker image)
> 
> # gcc -o m.alpine -O0 app_malloc.c
> 
> # time ./m.alpine
> real    6m 30.51s
> user    1m 36.67s
> sys    4m 53.31s
> 
> 
> 
> musl seems spend way too much time within kernel, while glibc hold most work within userspace.
> I used perf_event_open to profile those programs:
> musl profiling(total  302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap
> 
> munmap(30.858% 93469/302899)
> _init?(22.583% 68404/302899)
> aligned_alloc?(89.290% 61078/68404)
> asm_exc_page_fault(45.961% 28072/61078)
> main(9.001% 6157/68404)
> asm_exc_page_fault(29.170% 1796/6157)
> rand(1.266% 866/68404)
> aligned_alloc?(20.437% 61904/302899)
> asm_exc_page_fault(56.038% 34690/61904)
> madvise(13.275% 40209/302899)
> mmap64(11.125% 33698/302899)
> 
> 
> But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"
> 
> 
> 
> pthread_attr_setschedparam?(82.021% 23845/29072)
> asm_exc_page_fault(1.657% 395/23845)
> _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
> cfree(58.839% 2859/4859)
> main(31.138% 1513/4859)
> asm_exc_page_fault(2.115% 32/1513)
> pthread_attr_setschedparam?(3.725% 181/4859)
> random(2.099% 102/4859)
> random_r(1.832% 89/4859)
> __libc_malloc(1.420% 69/4859)
> It seems to be me, glibc make lots of uasage of cache of kernel
> memory and avoid lots of pagefault and syscalls.
> Is this performance difference should concern realworld
> applications? On average, musl actual spend about 3~4ns per
> malloc/free, which is quite acceptable in realworld applications, I
> think.
> 
> 
> 
> (Seems to me, that the performance difference has nothing to do with
> malloc_usable_size, which may be indeed just a speculative guess
> without any base)

Indeed this has nothing to do with it. What you're seeing is just that
musl/mallocng return freed memory, and glibc, basically, doesn't
(modulo the special case of large contiguous free block at 'top' of
heap). This inherently has a time cost.

mallocng does make significant efforts to avoid hammering mmap/munmap
under repeated malloc/free, at least in cases where it can reasonably
be deemed to matter. However, this is best-effort, and always a
tradeoff on (potential) large unwanted memory usage vs performance.
More on this later.

Your test case, with the completely random size distribution across
various large sizes, is likely a worst case. The mean size you're
allocating is 128k, which is the threshold for direct mmap/munmap of
each allocation, so at least half of the allocations you're making can
*never* be reused, and will always be immediately unmapped on free. It
might be interesting to change the scaling factor from 1k to 256 bytes
so that basically all of the allocation sizes are in the
malloc-managed range.

Now, I mentioned above "when it can reasonably be deemed to matter".
One of the assumptions mallocng makes for deciding where to risk
larger memory usage for the sake of performance is that you're
*actually going to use* the memory you're allocating. Yes, 6min might
seem like a lot of time to do nothing, but that "doing nothing" was
churning through allocating 12 TB of memory. How much time would you
have spent *actually processing* 12 TB of data (and not as flat linear
data, but avg-128kB chunks), and what percentage of that time would
the 6min in malloc have been?

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-21 17:15                     ` [musl] " Rich Felker
@ 2022-09-21 17:58                       ` Rich Felker
  2022-09-22  3:34                         ` [musl] " 王志强
  0 siblings, 1 reply; 68+ messages in thread
From: Rich Felker @ 2022-09-21 17:58 UTC (permalink / raw)
  To: 王志强; +Cc: musl, Quentin Rameau, Florian Weimer

On Wed, Sep 21, 2022 at 01:15:35PM -0400, Rich Felker wrote:
> On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote:
> > Hi Rich,
> > 
> > 
> > 
> > I am quite interested into the topic,  and made a comparation between glibc and musl with following code:
> > #define MAXF 4096
> > void* tobefree[MAXF];
> > int main() {
> >     long long i;
> >     int v, k;
> >     size_t s, c=0;
> >     char *p;
> >     for (i=0; i<100000000L; i++) {
> >         v = rand();   
> >         s = ((v%256)+1)*1024;
> >         p = (char*) malloc(s);
> >         p[1023]=0;
> >         if (c>=MAXF) {
> >             k = v%c;
> >             free(tobefree[k]);
> >             tobefree[k]=tobefree[--c];
> >         }
> >         tobefree[c++]=p;
> >     }
> >     return 0;
> > }
> > ```
> > 
> > The results show signaficant difference.
> > With glibc, (running within a debian docker image)
> > # gcc -o m.debian -O0 app_malloc.c
> > 
> > # time ./m.debian
> > real    0m37.529s
> > user    0m36.677s
> > sys    0m0.771s
> > 
> > With musl, (runnign within a alpine3.15 docker image)
> > 
> > # gcc -o m.alpine -O0 app_malloc.c
> > 
> > # time ./m.alpine
> > real    6m 30.51s
> > user    1m 36.67s
> > sys    4m 53.31s
> > 
> > 
> > 
> > musl seems spend way too much time within kernel, while glibc hold most work within userspace.
> > I used perf_event_open to profile those programs:
> > musl profiling(total  302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap
> > 
> > munmap(30.858% 93469/302899)
> > _init?(22.583% 68404/302899)
> > aligned_alloc?(89.290% 61078/68404)
> > asm_exc_page_fault(45.961% 28072/61078)
> > main(9.001% 6157/68404)
> > asm_exc_page_fault(29.170% 1796/6157)
> > rand(1.266% 866/68404)
> > aligned_alloc?(20.437% 61904/302899)
> > asm_exc_page_fault(56.038% 34690/61904)
> > madvise(13.275% 40209/302899)
> > mmap64(11.125% 33698/302899)
> > 
> > 
> > But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"
> > 
> > 
> > 
> > pthread_attr_setschedparam?(82.021% 23845/29072)
> > asm_exc_page_fault(1.657% 395/23845)
> > _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
> > cfree(58.839% 2859/4859)
> > main(31.138% 1513/4859)
> > asm_exc_page_fault(2.115% 32/1513)
> > pthread_attr_setschedparam?(3.725% 181/4859)
> > random(2.099% 102/4859)
> > random_r(1.832% 89/4859)
> > __libc_malloc(1.420% 69/4859)
> > It seems to be me, glibc make lots of uasage of cache of kernel
> > memory and avoid lots of pagefault and syscalls.
> > Is this performance difference should concern realworld
> > applications? On average, musl actual spend about 3~4ns per
> > malloc/free, which is quite acceptable in realworld applications, I
> > think.
> > 
> > 
> > 
> > (Seems to me, that the performance difference has nothing to do with
> > malloc_usable_size, which may be indeed just a speculative guess
> > without any base)
> 
> Indeed this has nothing to do with it. What you're seeing is just that
> musl/mallocng return freed memory, and glibc, basically, doesn't
> (modulo the special case of large contiguous free block at 'top' of
> heap). This inherently has a time cost.
> 
> mallocng does make significant efforts to avoid hammering mmap/munmap
> under repeated malloc/free, at least in cases where it can reasonably
> be deemed to matter. However, this is best-effort, and always a
> tradeoff on (potential) large unwanted memory usage vs performance.
> More on this later.
> 
> Your test case, with the completely random size distribution across
> various large sizes, is likely a worst case. The mean size you're
> allocating is 128k, which is the threshold for direct mmap/munmap of
> each allocation, so at least half of the allocations you're making can
> *never* be reused, and will always be immediately unmapped on free. It
> might be interesting to change the scaling factor from 1k to 256 bytes
> so that basically all of the allocation sizes are in the
> malloc-managed range.

One observation if this change is made: it looks like at least 70% of
the time is spent performing madvise(MADV_FREE), and that a large
portion of the rest (just looking at strace) seems to be repeatedly
mapping and freeing a 17-page (68k) block, probably because this size
happens to be at the boundary of some threshold where bounce
protection isn't happening. I think we should look at both of these in
more detail, since they both suggest opportunities for large
performance improvements at low cost.

Rich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-21 17:58                       ` Rich Felker
@ 2022-09-22  3:34                         ` 王志强
  2022-09-22  9:10                           ` [musl] " 王志强
  0 siblings, 1 reply; 68+ messages in thread
From: 王志强 @ 2022-09-22  3:34 UTC (permalink / raw)
  To: musl, dalias; +Cc: Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 3561 bytes --]

Hi Rich,


Thanks for your time.


Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the  average 3 microsecond per malloc&free.





At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote:

>
>
>> Your test case, with the completely random size distribution across

>> various large sizes, is likely a worst case. The mean size you're

>> allocating is 128k, which is the threshold for direct mmap/munmap of

>> each allocation, so at least half of the allocations you're making can

>> *never* be reused, and will always be immediately unmapped on free. It
>> might be interesting to change the scaling factor from 1k to 256 bytes

>> so that basically all of the allocation sizes are in the >> malloc-managed range.

>

>One observation if this change is made: it looks like at least 70% of

>the time is spent performing madvise(MADV_FREE), and that a large

>portion of the rest (just looking at strace) seems to be repeatedly

>mapping and freeing a 17-page (68k) block, probably because this size

>happens to be at the boundary of some threshold where bounce

>protection isn't happening. I think we should look at both of these in

>more detail, since they both suggest opportunities for large

>performance improvements at low cost.

>

I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested,

also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%:
1. average 64K(1K~128K) malloc/free
# time ./m.alpine
real    1m 50.12s
user    0m 39.80s
sys    1m 10.17s

madvise(61.945% 52926/85440)
__libc_start_main?(22.158% 18932/85440)
malloc_usable_size?(82.870% 15689/18932)
asm_exc_page_fault(2.766% 434/15689)
main(16.781% 3177/18932)
asm_exc_page_fault(2.487% 79/3177)
malloc_usable_size?(10.969% 9372/85440)
asm_exc_page_fault(6.519% 611/9372)
munmap(2.449% 2092/85440)
exit?(1.540% 1316/85440)
2. average 32K (1K~64K) malloc/free
# time ./m.alpine
real    1m 12.89s
user    0m 30.62s
sys    0m 41.91s

madvise(60.835% 34282/56352)
__libc_start_main?(27.410% 15446/56352)
malloc_usable_size?(78.558% 12134/15446)
main(20.996% 3243/15446)
malloc_usable_size?(9.354% 5271/56352)
exit?(1.888% 1064/56352)


3. average 8K (1K~16K)

# time ./m.alpine
real    0m 42.35s
user    0m 22.94s
sys    0m 19.27s

madvise(49.338% 16169/32772)
__libc_start_main?(36.592% 11992/32772)
malloc_usable_size?(79.244% 9503/11992)
main(20.480% 2456/11992)
malloc_usable_size?(10.921% 3579/32772)
exit?(2.591% 849/32772)
4. average 4K (1k~8K)
# time ./m.debian
real    0m32.477s
user    0m31.829s
sys    0m0.596s

__libc_start_main?(44.474% 9279/20864)
malloc_usable_size?(81.410% 7554/9279)
main(17.987% 1669/9279)
madvise(37.720% 7870/20864)
malloc_usable_size?(13.986% 2918/20864)
exit?(3.350% 699/20864)
5 average 2K(128B~4096B) (madviser only about 1.7%)

# time ./m.alpine
real    0m 13.02s
user    0m 12.68s
sys    0m 0.26s

__libc_start_main?(69.538% 6974/10029)
malloc_usable_size?(80.786% 5634/6974)
main(18.569% 1295/6974)
malloc_usable_size?(21.538% 2160/10029)
exit?(7.060% 708/10029)
madvise(1.715% 172/10029)
6. average 1K (128B~2048B)

# time ./m.alpine
real    0m 10.75s
user    0m 10.68s
sys    0m 0.01s

__libc_start_main?(72.495% 6012/8293)
malloc_usable_size?(76.630% 4607/6012)
main(22.904% 1377/6012)
malloc_usable_size?(18.823% 1561/8293)
exit?(8.610% 714/8293)




David

[-- Attachment #2: Type: text/html, Size: 8386 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-22  3:34                         ` [musl] " 王志强
@ 2022-09-22  9:10                           ` 王志强
  2022-09-22  9:39                             ` [musl] " 王志强
  0 siblings, 1 reply; 68+ messages in thread
From: 王志强 @ 2022-09-22  9:10 UTC (permalink / raw)
  To: musl; +Cc: dalias, Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 5613 bytes --]

update:
I change the test code to include malloc_usable_size,  just to verify the impact of this function call

#define MAXF 4096
void* tobefree[MAXF];
int main() {
    long long i;
    int v, k, j;
    size_t s, c=0;
    char *p;
    for (i=0; i<100000000L; i++) {
        v = rand();   
        s = ((v%126)+1)*1024;
        p = (char*) malloc(s);
    for (j=0; j+1<s; j+=1024) p[j+1]=j;  ////<<--poke pages
    s = malloc_usable_size(p);   ////<<-----------------add here
        if (c>=MAXF) {
            k = v%c;
            free(tobefree[k]);
            tobefree[k]=tobefree[--c];
        }
        tobefree[c++]=p;
    }
    return 0;
}
# time ./m.alpine
real    7m 19.07s
user    4m 53.61s
sys    2m 25.22s
It took about 440 seconds to finish the whole 100 million iterations (on average 4.4microseconds per malloc&free,  while with glibc it took 96 seconds, average 1microsecond) the profiling report is as following
_init__libc_start_main(59.374% 202450/340976)
main(79.104% 160147/202450)
aligned_allocmalloc_usable_size(20.098% 40689/202450)
madvise(33.323% 113623/340976)
aligned_allocmalloc_usable_size(6.534% 22279/340976)


aligned_allocmalloc_usable_size got picked by profiler about (40689+22279)/340976 = 18.4%,  should this raise any concern? (Still,  madvise calls has major impact, 33.3%)



(The profiling report is different from last few because I add code to poke several memory addresses after malloc)



Here is the profiler report for glibc, running same code, malloc_usable_size got picked only 1% of total 75568 samples

_dl_catch_error??(62.487% 47220/75568)
__libc_start_main(100.000% 47220/47220)
main(80.064% 37806/47220)
cfree(17.274% 8157/47220)
malloc_usable_size(1.355% 640/47220)
pthread_attr_setschedparam?__libc_malloc(36.795% 27805/75568)


My profiler code is here,  https://github.com/zq-david-wang/linux-tools/tree/main/perf/profiler, it take a pid as parameter and would profile all the pids within same perf_event cgroup.




















At 2022-09-22 11:34:59, "王志强" <00107082@163.com> wrote:

Hi Rich,


Thanks for your time.


Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the  average 3 microsecond per malloc&free.





At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote:

>
>
>> Your test case, with the completely random size distribution across

>> various large sizes, is likely a worst case. The mean size you're

>> allocating is 128k, which is the threshold for direct mmap/munmap of

>> each allocation, so at least half of the allocations you're making can

>> *never* be reused, and will always be immediately unmapped on free. It
>> might be interesting to change the scaling factor from 1k to 256 bytes

>> so that basically all of the allocation sizes are in the >> malloc-managed range.

>

>One observation if this change is made: it looks like at least 70% of

>the time is spent performing madvise(MADV_FREE), and that a large

>portion of the rest (just looking at strace) seems to be repeatedly

>mapping and freeing a 17-page (68k) block, probably because this size

>happens to be at the boundary of some threshold where bounce

>protection isn't happening. I think we should look at both of these in

>more detail, since they both suggest opportunities for large

>performance improvements at low cost.

>

I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested,

also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%:
1. average 64K(1K~128K) malloc/free
# time ./m.alpine
real    1m 50.12s
user    0m 39.80s
sys    1m 10.17s

madvise(61.945% 52926/85440)
__libc_start_main?(22.158% 18932/85440)
malloc_usable_size?(82.870% 15689/18932)
asm_exc_page_fault(2.766% 434/15689)
main(16.781% 3177/18932)
asm_exc_page_fault(2.487% 79/3177)
malloc_usable_size?(10.969% 9372/85440)
asm_exc_page_fault(6.519% 611/9372)
munmap(2.449% 2092/85440)
exit?(1.540% 1316/85440)
2. average 32K (1K~64K) malloc/free
# time ./m.alpine
real    1m 12.89s
user    0m 30.62s
sys    0m 41.91s

madvise(60.835% 34282/56352)
__libc_start_main?(27.410% 15446/56352)
malloc_usable_size?(78.558% 12134/15446)
main(20.996% 3243/15446)
malloc_usable_size?(9.354% 5271/56352)
exit?(1.888% 1064/56352)


3. average 8K (1K~16K)

# time ./m.alpine
real    0m 42.35s
user    0m 22.94s
sys    0m 19.27s

madvise(49.338% 16169/32772)
__libc_start_main?(36.592% 11992/32772)
malloc_usable_size?(79.244% 9503/11992)
main(20.480% 2456/11992)
malloc_usable_size?(10.921% 3579/32772)
exit?(2.591% 849/32772)
4. average 4K (1k~8K)
# time ./m.debian
real    0m32.477s
user    0m31.829s
sys    0m0.596s

__libc_start_main?(44.474% 9279/20864)
malloc_usable_size?(81.410% 7554/9279)
main(17.987% 1669/9279)
madvise(37.720% 7870/20864)
malloc_usable_size?(13.986% 2918/20864)
exit?(3.350% 699/20864)
5 average 2K(128B~4096B) (madviser only about 1.7%)

# time ./m.alpine
real    0m 13.02s
user    0m 12.68s
sys    0m 0.26s

__libc_start_main?(69.538% 6974/10029)
malloc_usable_size?(80.786% 5634/6974)
main(18.569% 1295/6974)
malloc_usable_size?(21.538% 2160/10029)
exit?(7.060% 708/10029)
madvise(1.715% 172/10029)
6. average 1K (128B~2048B)

# time ./m.alpine
real    0m 10.75s
user    0m 10.68s
sys    0m 0.01s

__libc_start_main?(72.495% 6012/8293)
malloc_usable_size?(76.630% 4607/6012)
main(22.904% 1377/6012)
malloc_usable_size?(18.823% 1561/8293)
exit?(8.610% 714/8293)




David

[-- Attachment #2: Type: text/html, Size: 13041 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [musl] Re:[musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-22  9:10                           ` [musl] " 王志强
@ 2022-09-22  9:39                             ` 王志强
  0 siblings, 0 replies; 68+ messages in thread
From: 王志强 @ 2022-09-22  9:39 UTC (permalink / raw)
  To: musl; +Cc: dalias, Quentin Rameau, Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 6389 bytes --]




Sorry, I found I made a wrong reading about profiling in last mail about malloc_usable_size,  I think malloc_usable_size is indeed has no impact.


The symbol "aligned_allocmalloc_usable_size" should be aligned_alloc<?>malloc_usable_size,  means a function address not symbolized, and the address belongs to neither aligned_alloc nor  malloc_usable_size, but is between   aligned_alloc and malloc_usable_size, not sure what it belongs to  (Most likely, it is some static function, somewhere)  Compared with old profiling report which did not call malloc_usable_size, there were also significant addresses around  that range, so...



Please ignore my last mail about malloc_usable_size.....sorry....













At 2022-09-22 17:10:18, "王志强" <00107082@163.com> wrote:

update:
I change the test code to include malloc_usable_size,  just to verify the impact of this function call

#define MAXF 4096
void* tobefree[MAXF];
int main() {
    long long i;
    int v, k, j;
    size_t s, c=0;
    char *p;
    for (i=0; i<100000000L; i++) {
        v = rand();   
        s = ((v%126)+1)*1024;
        p = (char*) malloc(s);
    for (j=0; j+1<s; j+=1024) p[j+1]=j;  ////<<--poke pages
    s = malloc_usable_size(p);   ////<<-----------------add here
        if (c>=MAXF) {
            k = v%c;
            free(tobefree[k]);
            tobefree[k]=tobefree[--c];
        }
        tobefree[c++]=p;
    }
    return 0;
}
# time ./m.alpine
real    7m 19.07s
user    4m 53.61s
sys    2m 25.22s
It took about 440 seconds to finish the whole 100 million iterations (on average 4.4microseconds per malloc&free,  while with glibc it took 96 seconds, average 1microsecond) the profiling report is as following
_init__libc_start_main(59.374% 202450/340976)
main(79.104% 160147/202450)
aligned_allocmalloc_usable_size(20.098% 40689/202450)
madvise(33.323% 113623/340976)
aligned_allocmalloc_usable_size(6.534% 22279/340976)


aligned_allocmalloc_usable_size got picked by profiler about (40689+22279)/340976 = 18.4%,  should this raise any concern? (Still,  madvise calls has major impact, 33.3%)



(The profiling report is different from last few because I add code to poke several memory addresses after malloc)



Here is the profiler report for glibc, running same code, malloc_usable_size got picked only 1% of total 75568 samples

_dl_catch_error??(62.487% 47220/75568)
__libc_start_main(100.000% 47220/47220)
main(80.064% 37806/47220)
cfree(17.274% 8157/47220)
malloc_usable_size(1.355% 640/47220)
pthread_attr_setschedparam?__libc_malloc(36.795% 27805/75568)


My profiler code is here,  https://github.com/zq-david-wang/linux-tools/tree/main/perf/profiler, it take a pid as parameter and would profile all the pids within same perf_event cgroup.




















At 2022-09-22 11:34:59, "王志强" <00107082@163.com> wrote:

Hi Rich,


Thanks for your time.


Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the  average 3 microsecond per malloc&free.





At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote:

>
>
>> Your test case, with the completely random size distribution across

>> various large sizes, is likely a worst case. The mean size you're

>> allocating is 128k, which is the threshold for direct mmap/munmap of

>> each allocation, so at least half of the allocations you're making can

>> *never* be reused, and will always be immediately unmapped on free. It
>> might be interesting to change the scaling factor from 1k to 256 bytes

>> so that basically all of the allocation sizes are in the >> malloc-managed range.

>

>One observation if this change is made: it looks like at least 70% of

>the time is spent performing madvise(MADV_FREE), and that a large

>portion of the rest (just looking at strace) seems to be repeatedly

>mapping and freeing a 17-page (68k) block, probably because this size

>happens to be at the boundary of some threshold where bounce

>protection isn't happening. I think we should look at both of these in

>more detail, since they both suggest opportunities for large

>performance improvements at low cost.

>

I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested,

also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%:
1. average 64K(1K~128K) malloc/free
# time ./m.alpine
real    1m 50.12s
user    0m 39.80s
sys    1m 10.17s

madvise(61.945% 52926/85440)
__libc_start_main?(22.158% 18932/85440)
malloc_usable_size?(82.870% 15689/18932)
asm_exc_page_fault(2.766% 434/15689)
main(16.781% 3177/18932)
asm_exc_page_fault(2.487% 79/3177)
malloc_usable_size?(10.969% 9372/85440)
asm_exc_page_fault(6.519% 611/9372)
munmap(2.449% 2092/85440)
exit?(1.540% 1316/85440)
2. average 32K (1K~64K) malloc/free
# time ./m.alpine
real    1m 12.89s
user    0m 30.62s
sys    0m 41.91s

madvise(60.835% 34282/56352)
__libc_start_main?(27.410% 15446/56352)
malloc_usable_size?(78.558% 12134/15446)
main(20.996% 3243/15446)
malloc_usable_size?(9.354% 5271/56352)
exit?(1.888% 1064/56352)


3. average 8K (1K~16K)

# time ./m.alpine
real    0m 42.35s
user    0m 22.94s
sys    0m 19.27s

madvise(49.338% 16169/32772)
__libc_start_main?(36.592% 11992/32772)
malloc_usable_size?(79.244% 9503/11992)
main(20.480% 2456/11992)
malloc_usable_size?(10.921% 3579/32772)
exit?(2.591% 849/32772)
4. average 4K (1k~8K)
# time ./m.debian
real    0m32.477s
user    0m31.829s
sys    0m0.596s

__libc_start_main?(44.474% 9279/20864)
malloc_usable_size?(81.410% 7554/9279)
main(17.987% 1669/9279)
madvise(37.720% 7870/20864)
malloc_usable_size?(13.986% 2918/20864)
exit?(3.350% 699/20864)
5 average 2K(128B~4096B) (madviser only about 1.7%)

# time ./m.alpine
real    0m 13.02s
user    0m 12.68s
sys    0m 0.26s

__libc_start_main?(69.538% 6974/10029)
malloc_usable_size?(80.786% 5634/6974)
main(18.569% 1295/6974)
malloc_usable_size?(21.538% 2160/10029)
exit?(7.060% 708/10029)
madvise(1.715% 172/10029)
6. average 1K (128B~2048B)

# time ./m.alpine
real    0m 10.75s
user    0m 10.68s
sys    0m 0.01s

__libc_start_main?(72.495% 6012/8293)
malloc_usable_size?(76.630% 4607/6012)
main(22.904% 1377/6012)
malloc_usable_size?(18.823% 1561/8293)
exit?(8.610% 714/8293)




David

[-- Attachment #2: Type: text/html, Size: 14461 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 17:34             ` Szabolcs Nagy
  2022-09-20 19:53               ` James Y Knight
@ 2022-09-24  8:55               ` Fangrui Song
  1 sibling, 0 replies; 68+ messages in thread
From: Fangrui Song @ 2022-09-24  8:55 UTC (permalink / raw)
  To: James Y Knight, musl, Florian Weimer, Rich Felker, baiyang

On 2022-09-20, Szabolcs Nagy wrote:
>* James Y Knight <jyknight@google.com> [2022-09-20 12:59:00 -0400]:
>
>> On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com>
>> wrote:
>>
>> > Adding support for something that's already declared as bad
>> > programming practice seems like a step backwards.  Instead, I hope we
>> > find a way to discourage active use of malloc_usable_size more
>> > strongly.
>>
>>
>> BTW, if folks aren't aware, there is already work on the C++ side to expose
>> an API which lets you request a heap allocation of _at least_ the given
>> size, which rounds the actual size up in whatever way the allocator likes,
>> and returns the pointer and actual size allocated. With this API, you
>> declare an explicit intent that all of the memory -- up to the returned
>> size -- is valid to use without needing to go back to the allocator to ask
>> for more.
>>
>> The proposal is still making its way through the standardization process,
>> but hopefully it'll make it into the next version of C++ after C++23.  (Of
>> course, that's not a sure thing until it happens.) Here's the doc, with
>> more rationale/etc:
>>   https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html
>
>this does not seem to discuss how existing applications
>that override new() would cope with this.
>
>nor how existing implementations on top of c allocators
>would implement it (given that we just agreed that
>malloc_usable_size is not suitable for such use).
>
>nor how existing allocator tooling (interposers, profilers)
>would handle the new interface.
>
>>
>> Also, as noted in the doc, jemalloc experimentally implemented this
>> functionality in its non-standard API, via a function it called "smallocx"
>> -- though jemalloc hides the API so it can't be used by default. The API is
>> effectively:
>>   typedef struct { void *ptr; size_t size; } smallocx_return_t;
>>   smallocx_return_t smallocx(size_t size, int flags);
>> https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414
>> (That's consistent with jemalloc's other non-standard APIs, which stick
>> alignment/etc into a "flags" argument, but probably not suitable for a
>> more-standardized cross-implementation API)
>>
>> tcmalloc implements similar functionality, as well, with family of
>> functions named "tcmalloc_size_returning_operator_new":
>
>so there are already incompatible c apis, which means this
>should not be considered a viable proposal at this point.

Small addition: https://wg21.link/P0401R6 (allocate_at_least) has made it into C++23.
https://reviews.llvm.org/D122877 libc++ has implemented it in the
trivial way that just returns the user-requested size. 

>> https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549
>> which of course also isn't a suitable API to support cross-implementation.
>>
>> If someone wants to push forward this area, IMO, it would be really great
>> to have an API exposing this functionality designed to be implemented in a
>> common way across libc malloc implementations -- and eventually added to
>> POSIX or C.
>
>this is done the wrong way around.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
  2022-09-20 13:54         ` Siddhesh Poyarekar
  2022-09-20 16:59           ` James Y Knight
  2022-09-20 17:28           ` baiyang
@ 2022-10-10 14:13           ` Florian Weimer
  2 siblings, 0 replies; 68+ messages in thread
From: Florian Weimer @ 2022-10-10 14:13 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: Rich Felker, baiyang, musl

* Siddhesh Poyarekar:

> On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
>> The compiler needs to treat malloc_usable_size similar to realloc and
>> just the size information for the buffer based on the return value from
>> malloc_usable_size.  This is admittedly harder to do than a comparable
>> analysis for realloc if the compiler interprets the standard in such a
>> way that after a successful realloc, any access to the original pointer
>> value is undefined.
>>
>> malloc_usable_size is not actually *that* useful with allocators that do
>> not have strict size classes because they do not over-allocate that
>> much.  For these allocators, it may be possible to increase the size of
>> allocation significantly without moving it, but that is not reflected in
>> the return value of malloc_usable_size at all.
>
> So the glibc manual does not document malloc_usable_size semantics
> (which is weird since it is, well, a GNU extension!)

I think we got it via dlmalloc, which says this:

/*
  malloc_usable_size(void* p);

  Returns the number of bytes you can actually use in
  an allocated chunk, which may be more than you requested (although
  often not) due to alignment and minimum size constraints.
  You can use this many bytes without worrying about
  overwriting other allocated objects. This is not a particularly great
  programming practice. malloc_usable_size can be more useful in
  debugging and assertions, for example:

  p = malloc(n);
  assert(malloc_usable_size(p) >= 256);
*/

I don't think it's a GNU invention.  The GNU malloc used to be Mike
Haertel's malloc, as far as I can tell, and that didn't have the
function.  GNU malloc had a malloc_object_allocation_size function at
one point, it seems, but I don't know if that was before it was a
rebranded dlmalloc.  The malloc subsystem wasn't part of the glibc CVS
at the time.

> Adding support for something that's already declared as bad
> programming practice seems like a step backwards.  Instead, I hope we
> find a way to discourage active use of malloc_usable_size more
> strongly.  At least based on the systemd experience, the problem they
> try to solve is that of glibc realloc being too slow for paths where
> the reallocation should return the same block and that should be easy
> to special-case.  Is there any other valid reason to use
> malloc_usable_size instead of simply using realloc?

Do you know which case systemd tries to optimize?  Increasing the
allocation size or lowering it?

Maybe we should define a non-copying version of realloc that if that's
what programmers want.

I've got a write-up somewhere what an a replacement for class-free
allocators of jemalloc's xallocx function would look like.  I think it
has to involve arithmetic progressions (but no L-series), so it's not
pretty.

Anyway, if there is no real use case for malloc_usable_size with the
current glibc realloc implementation, we should really deprecate it.
Especially since C++ is moving into an incompatible direction.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2022-10-10 14:13 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-19  7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang
2022-09-19 11:08 ` Szabolcs Nagy
2022-09-19 12:36   ` Florian Weimer
2022-09-19 13:46     ` Rich Felker
2022-09-19 13:53       ` James Y Knight
2022-09-19 17:40         ` baiyang
2022-09-19 18:14           ` Szabolcs Nagy
2022-09-19 18:40             ` baiyang
2022-09-19 19:07             ` Gabriel Ravier
2022-09-19 19:21               ` Rich Felker
2022-09-19 21:02                 ` Gabriel Ravier
2022-09-19 21:47                   ` Rich Felker
2022-09-19 22:31                     ` Gabriel Ravier
2022-09-19 22:46                       ` baiyang
2022-09-19 20:46             ` Nat!
2022-09-20  8:51               ` Szabolcs Nagy
2022-09-20  0:13           ` James Y Knight
2022-09-20  0:25             ` baiyang
2022-09-20  0:38               ` Rich Felker
2022-09-20  0:47                 ` baiyang
2022-09-20  1:00                   ` Rich Felker
2022-09-20  1:18                     ` baiyang
2022-09-20  2:15                       ` Rich Felker
2022-09-20  2:35                         ` baiyang
2022-09-20  3:28                           ` Rich Felker
2022-09-20  3:53                             ` baiyang
2022-09-20  5:41                               ` Rich Felker
2022-09-20  5:56                                 ` baiyang
2022-09-20 12:16                                   ` Rich Felker
2022-09-20 17:21                                     ` baiyang
2022-09-20  8:33       ` Florian Weimer
2022-09-20 13:54         ` Siddhesh Poyarekar
2022-09-20 16:59           ` James Y Knight
2022-09-20 17:34             ` Szabolcs Nagy
2022-09-20 19:53               ` James Y Knight
2022-09-24  8:55               ` Fangrui Song
2022-09-20 17:39             ` baiyang
2022-09-20 18:12               ` Quentin Rameau
2022-09-20 18:19                 ` Rich Felker
2022-09-20 18:26                   ` Alexander Monakov
2022-09-20 18:35                     ` baiyang
2022-09-20 20:33                       ` Gabriel Ravier
2022-09-20 20:45                         ` baiyang
2022-09-21  8:42                           ` NRK
2022-09-20 18:37                     ` Quentin Rameau
2022-09-21 10:15                   ` [musl] " 王志强
2022-09-21 16:11                     ` [musl] " 王志强
2022-09-21 17:15                     ` [musl] " Rich Felker
2022-09-21 17:58                       ` Rich Felker
2022-09-22  3:34                         ` [musl] " 王志强
2022-09-22  9:10                           ` [musl] " 王志强
2022-09-22  9:39                             ` [musl] " 王志强
2022-09-20 17:28           ` baiyang
2022-09-20 17:44             ` Siddhesh Poyarekar
2022-10-10 14:13           ` Florian Weimer
2022-09-19 13:43 ` Rich Felker
2022-09-19 17:32   ` baiyang
2022-09-19 18:15     ` Rich Felker
2022-09-19 18:44       ` baiyang
2022-09-19 19:18         ` Rich Felker
2022-09-19 19:45           ` baiyang
2022-09-19 20:07             ` Rich Felker
2022-09-19 20:17               ` baiyang
2022-09-19 20:28                 ` Rich Felker
2022-09-19 20:38                   ` baiyang
2022-09-19 22:02                 ` Quentin Rameau
2022-09-19 20:17             ` Joakim Sindholt
2022-09-19 20:33               ` baiyang

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).