* [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) @ 2022-09-19 7:53 baiyang 2022-09-19 11:08 ` Szabolcs Nagy 2022-09-19 13:43 ` Rich Felker 0 siblings, 2 replies; 68+ messages in thread From: baiyang @ 2022-09-19 7:53 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 759 bytes --] Hi there, As we have discussed at https://github.com/openwrt/openwrt/issues/10752. The malloc_usable_size() function in musl 1.2 (mallocng) seems to have some performance issues. It caused realloc and free spends too long time for get the chunk size. As we mentioned in the discussion, tcmalloc and some other allocators can also accurately obtain the size class corresponding to a memory block and its precise size, and it is also very fast at the same time. Can we make some improvements to the existing malloc_usable_size algorithm in mallocng? This should significantly improve the performance of existing algorithms. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** [-- Attachment #2: Type: text/html, Size: 2921 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang @ 2022-09-19 11:08 ` Szabolcs Nagy 2022-09-19 12:36 ` Florian Weimer 2022-09-19 13:43 ` Rich Felker 1 sibling, 1 reply; 68+ messages in thread From: Szabolcs Nagy @ 2022-09-19 11:08 UTC (permalink / raw) To: baiyang; +Cc: musl * baiyang <baiyang@gmail.com> [2022-09-19 15:53:30 +0800]: > Hi there, > > As we have discussed at https://github.com/openwrt/openwrt/issues/10752. The malloc_usable_size() function in musl 1.2 (mallocng) seems to have some performance issues. > > It caused realloc and free spends too long time for get the chunk size. how is malloc_usable_size related to free and realloc performance? > > As we mentioned in the discussion, tcmalloc and some other allocators can also accurately obtain the size class corresponding to a memory block and its precise size, and it is also very fast at the same time. unlike musl those implementations don't return exact size nor have the same security and memory fragmentation guarantees, so bad comparision. tcmalloc: // Returns the actual number N of bytes reserved by tcmalloc for the pointer // p. This number may be equal to or greater than the number of bytes // requested when p was allocated. // // This function is just useful for statistics collection. The client must // *not* read or write from the extra bytes that are indicated by this call. jemalloc: <para>The <function>malloc_usable_size()</function> function returns the usable size of the allocation pointed to by <parameter>ptr</parameter>. The return value may be larger than the size that was requested during allocation. The <function>malloc_usable_size()</function> function is not a mechanism for in-place <function>realloc()</function>; rather it is provided solely as a tool for introspection purposes. Any discrepancy between the requested allocation size and the size reported by <function>malloc_usable_size()</function> should not be depended on, since such behavior is entirely implementation-dependent. > > Can we make some improvements to the existing malloc_usable_size algorithm in mallocng? This should significantly improve the performance of existing algorithms. can you give an actual interesting workload where that's the bottleneck? given that calling this function is almost always a bug in production code, it's hard to see how it can cause performance problems. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 11:08 ` Szabolcs Nagy @ 2022-09-19 12:36 ` Florian Weimer 2022-09-19 13:46 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: Florian Weimer @ 2022-09-19 12:36 UTC (permalink / raw) To: baiyang; +Cc: musl * Szabolcs Nagy: > unlike musl those implementations don't return exact size nor have the > same security and memory fragmentation guarantees, so bad comparision. > > tcmalloc: > // Returns the actual number N of bytes reserved by tcmalloc for the pointer > // p. This number may be equal to or greater than the number of bytes > // requested when p was allocated. > // > // This function is just useful for statistics collection. The client must > // *not* read or write from the extra bytes that are indicated by this call. > > jemalloc: > <para>The <function>malloc_usable_size()</function> function > returns the usable size of the allocation pointed to by > <parameter>ptr</parameter>. The return value may be larger than the size > that was requested during allocation. The > <function>malloc_usable_size()</function> function is not a > mechanism for in-place <function>realloc()</function>; rather > it is provided solely as a tool for introspection purposes. Any > discrepancy between the requested allocation size and the size reported > by <function>malloc_usable_size()</function> should not be > depended on, since such behavior is entirely implementation-dependent. These implementations are buggy or at least mis-documented. The interface contract is clearly that for that particular object, the extra bytes in the allocation are available for reading and writing. It is not guaranteed that the allocator will always provide the same number of extra bytes for the same requested size, but they must be there for the allocation being examined. It's even in the name of the function! Thanks, Florian ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 12:36 ` Florian Weimer @ 2022-09-19 13:46 ` Rich Felker 2022-09-19 13:53 ` James Y Knight 2022-09-20 8:33 ` Florian Weimer 0 siblings, 2 replies; 68+ messages in thread From: Rich Felker @ 2022-09-19 13:46 UTC (permalink / raw) To: Florian Weimer; +Cc: baiyang, musl On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote: > * Szabolcs Nagy: > > > unlike musl those implementations don't return exact size nor have the > > same security and memory fragmentation guarantees, so bad comparision. > > > > tcmalloc: > > // Returns the actual number N of bytes reserved by tcmalloc for the pointer > > // p. This number may be equal to or greater than the number of bytes > > // requested when p was allocated. > > // > > // This function is just useful for statistics collection. The client must > > // *not* read or write from the extra bytes that are indicated by this call. > > > > jemalloc: > > <para>The <function>malloc_usable_size()</function> function > > returns the usable size of the allocation pointed to by > > <parameter>ptr</parameter>. The return value may be larger than the size > > that was requested during allocation. The > > <function>malloc_usable_size()</function> function is not a > > mechanism for in-place <function>realloc()</function>; rather > > it is provided solely as a tool for introspection purposes. Any > > discrepancy between the requested allocation size and the size reported > > by <function>malloc_usable_size()</function> should not be > > depended on, since such behavior is entirely implementation-dependent. > > These implementations are buggy or at least mis-documented. The > interface contract is clearly that for that particular object, the extra > bytes in the allocation are available for reading and writing. It is > not guaranteed that the allocator will always provide the same number of > extra bytes for the same requested size, but they must be there for the > allocation being examined. It's even in the name of the function! I'm not sure I understand what you're saying, but the core problem that really can't be solved is potential discrepancy between the malloc implementation's idea of usable and the compiler's. For example: char *p = malloc(1); if (malloc_usable_size(p)>1) p[1] = 42; will cause a compiler that's actively detecting UB to abort the program when malloc_usable_size returns a value larger than 1. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 13:46 ` Rich Felker @ 2022-09-19 13:53 ` James Y Knight 2022-09-19 17:40 ` baiyang 2022-09-20 8:33 ` Florian Weimer 1 sibling, 1 reply; 68+ messages in thread From: James Y Knight @ 2022-09-19 13:53 UTC (permalink / raw) To: musl; +Cc: Florian Weimer, baiyang [-- Attachment #1: Type: text/plain, Size: 3248 bytes --] Indeed. RedHat mentioned that problem in their recent post about _FORTIFY_SOURCE=3, here https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level """ _FORTIFY_SOURCE=3 revealed another pattern. Applications such as systemd used malloc_usable_size to determine available space in objects and then used the residual space. The glibc manual discourages this type of usage, dictating that malloc_usable_size is for diagnostic purposes only. But applications use the function as a hack to avoid reallocating buffers when there is space in the underlying malloc chunk. The implementation of malloc_usable_size needs to be fixed to return the allocated object size instead of the chunk size in non-diagnostic use. Alternatively, another solution is to deprecate the function. But that is a topic for discussion by the glibc community. """ On Mon, Sep 19, 2022 at 9:47 AM Rich Felker <dalias@libc.org> wrote: > On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote: > > * Szabolcs Nagy: > > > > > unlike musl those implementations don't return exact size nor have the > > > same security and memory fragmentation guarantees, so bad comparision. > > > > > > tcmalloc: > > > // Returns the actual number N of bytes reserved by tcmalloc for the > pointer > > > // p. This number may be equal to or greater than the number of > bytes > > > // requested when p was allocated. > > > // > > > // This function is just useful for statistics collection. The > client must > > > // *not* read or write from the extra bytes that are indicated by > this call. > > > > > > jemalloc: > > > <para>The <function>malloc_usable_size()</function> function > > > returns the usable size of the allocation pointed to by > > > <parameter>ptr</parameter>. The return value may be larger than > the size > > > that was requested during allocation. The > > > <function>malloc_usable_size()</function> function is not a > > > mechanism for in-place <function>realloc()</function>; rather > > > it is provided solely as a tool for introspection purposes. Any > > > discrepancy between the requested allocation size and the size > reported > > > by <function>malloc_usable_size()</function> should not be > > > depended on, since such behavior is entirely > implementation-dependent. > > > > These implementations are buggy or at least mis-documented. The > > interface contract is clearly that for that particular object, the extra > > bytes in the allocation are available for reading and writing. It is > > not guaranteed that the allocator will always provide the same number of > > extra bytes for the same requested size, but they must be there for the > > allocation being examined. It's even in the name of the function! > > I'm not sure I understand what you're saying, but the core problem > that really can't be solved is potential discrepancy between the > malloc implementation's idea of usable and the compiler's. For > example: > > char *p = malloc(1); > if (malloc_usable_size(p)>1) p[1] = 42; > > will cause a compiler that's actively detecting UB to abort the > program when malloc_usable_size returns a value larger than 1. > > Rich > [-- Attachment #2: Type: text/html, Size: 4150 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 13:53 ` James Y Knight @ 2022-09-19 17:40 ` baiyang 2022-09-19 18:14 ` Szabolcs Nagy 2022-09-20 0:13 ` James Y Knight 0 siblings, 2 replies; 68+ messages in thread From: baiyang @ 2022-09-19 17:40 UTC (permalink / raw) To: James Y Knight, musl; +Cc: Florian Weimer [-- Attachment #1: Type: text/plain, Size: 4264 bytes --] Hi James, I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: James Y Knight Date: 2022-09-19 21:53 To: musl CC: Florian Weimer; baiyang Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) Indeed. RedHat mentioned that problem in their recent post about _FORTIFY_SOURCE=3, here https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level """ _FORTIFY_SOURCE=3 revealed another pattern. Applications such as systemd used malloc_usable_size to determine available space in objects and then used the residual space. The glibc manual discourages this type of usage, dictating that malloc_usable_size is for diagnostic purposes only. But applications use the function as a hack to avoid reallocating buffers when there is space in the underlying malloc chunk. The implementation of malloc_usable_size needs to be fixed to return the allocated object size instead of the chunk size in non-diagnostic use. Alternatively, another solution is to deprecate the function. But that is a topic for discussion by the glibc community. """ On Mon, Sep 19, 2022 at 9:47 AM Rich Felker <dalias@libc.org> wrote: On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote: > * Szabolcs Nagy: > > > unlike musl those implementations don't return exact size nor have the > > same security and memory fragmentation guarantees, so bad comparision. > > > > tcmalloc: > > // Returns the actual number N of bytes reserved by tcmalloc for the pointer > > // p. This number may be equal to or greater than the number of bytes > > // requested when p was allocated. > > // > > // This function is just useful for statistics collection. The client must > > // *not* read or write from the extra bytes that are indicated by this call. > > > > jemalloc: > > <para>The <function>malloc_usable_size()</function> function > > returns the usable size of the allocation pointed to by > > <parameter>ptr</parameter>. The return value may be larger than the size > > that was requested during allocation. The > > <function>malloc_usable_size()</function> function is not a > > mechanism for in-place <function>realloc()</function>; rather > > it is provided solely as a tool for introspection purposes. Any > > discrepancy between the requested allocation size and the size reported > > by <function>malloc_usable_size()</function> should not be > > depended on, since such behavior is entirely implementation-dependent. > > These implementations are buggy or at least mis-documented. The > interface contract is clearly that for that particular object, the extra > bytes in the allocation are available for reading and writing. It is > not guaranteed that the allocator will always provide the same number of > extra bytes for the same requested size, but they must be there for the > allocation being examined. It's even in the name of the function! I'm not sure I understand what you're saying, but the core problem that really can't be solved is potential discrepancy between the malloc implementation's idea of usable and the compiler's. For example: char *p = malloc(1); if (malloc_usable_size(p)>1) p[1] = 42; will cause a compiler that's actively detecting UB to abort the program when malloc_usable_size returns a value larger than 1. Rich [-- Attachment #2: Type: text/html, Size: 8271 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 17:40 ` baiyang @ 2022-09-19 18:14 ` Szabolcs Nagy 2022-09-19 18:40 ` baiyang ` (2 more replies) 2022-09-20 0:13 ` James Y Knight 1 sibling, 3 replies; 68+ messages in thread From: Szabolcs Nagy @ 2022-09-19 18:14 UTC (permalink / raw) To: baiyang; +Cc: James Y Knight, musl, Florian Weimer * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: > I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). > > On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > > I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. obviously internally the implementation can use the internal chunk size... GetSize(p) is not the exact size (that the user allocated) but an internal size (which may be larger) and that must not be exposed *outside* of the malloc implementation (other than for diagnostic purposes). you can have 2 views: (1) tcmalloc and jemalloc are buggy because they expose an internal that must not be exposed (becaues it can break user code). (2) user code is buggy if it uses malloc_usable_size for any purpose other than diagnostic/statistics (because other uses are broken on many implementations). either way the brokenness you want to support is a security hazard and you are lucky that musl saves the day: it works hard not to expose internal sizes so the code you seem to care about can operate safely (which is not true on tcmalloc and jemalloc: the compiler may break that code). ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 18:14 ` Szabolcs Nagy @ 2022-09-19 18:40 ` baiyang 2022-09-19 19:07 ` Gabriel Ravier 2022-09-19 20:46 ` Nat! 2 siblings, 0 replies; 68+ messages in thread From: baiyang @ 2022-09-19 18:40 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: James Y Knight, musl, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 2582 bytes --] > GetSize(p) is not the exact size (that the user allocated) but an internal size (which may be larger) Yes, as I mentioned in another email, we just need this "internal size". The value returned by malloc_usable_size() may be greater than the requested size of the allocation. Also, I don't think there is any ambiguity in the manual pages of each platform regarding this "internal size": The value returned by malloc_usable_size() may be greater than the requested size of the allocation -- that's exactly what we want. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Szabolcs Nagy Date: 2022-09-20 02:14 To: baiyang CC: James Y Knight; musl; Florian Weimer Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: > I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). > > On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > > I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. obviously internally the implementation can use the internal chunk size... GetSize(p) is not the exact size (that the user allocated) but an internal size (which may be larger) and that must not be exposed *outside* of the malloc implementation (other than for diagnostic purposes). you can have 2 views: (1) tcmalloc and jemalloc are buggy because they expose an internal that must not be exposed (becaues it can break user code). (2) user code is buggy if it uses malloc_usable_size for any purpose other than diagnostic/statistics (because other uses are broken on many implementations). either way the brokenness you want to support is a security hazard and you are lucky that musl saves the day: it works hard not to expose internal sizes so the code you seem to care about can operate safely (which is not true on tcmalloc and jemalloc: the compiler may break that code). [-- Attachment #2: Type: text/html, Size: 5428 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 18:14 ` Szabolcs Nagy 2022-09-19 18:40 ` baiyang @ 2022-09-19 19:07 ` Gabriel Ravier 2022-09-19 19:21 ` Rich Felker 2022-09-19 20:46 ` Nat! 2 siblings, 1 reply; 68+ messages in thread From: Gabriel Ravier @ 2022-09-19 19:07 UTC (permalink / raw) To: baiyang, James Y Knight, musl, Florian Weimer, dalias On 9/19/22 20:14, Szabolcs Nagy wrote: > * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: >> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). >> >> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 >> >> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. > obviously internally the implementation can use the internal chunk size... > > GetSize(p) is not the exact size (that the user allocated) but an internal > size (which may be larger) and that must not be exposed *outside* of the > malloc implementation (other than for diagnostic purposes). > > you can have 2 views: > > (1) tcmalloc and jemalloc are buggy because they expose an internal > that must not be exposed (becaues it can break user code). > > (2) user code is buggy if it uses malloc_usable_size for any purpose > other than diagnostic/statistics (because other uses are broken > on many implementations). > > either way the brokenness you want to support is a security hazard > and you are lucky that musl saves the day: it works hard not to > expose internal sizes so the code you seem to care about can operate > safely (which is not true on tcmalloc and jemalloc: the compiler > may break that code). While I would agree that using malloc_usable_size is generally not a great idea (it's at most acceptable as a small micro-optimization, but I would only ever expect it to be seen in very well-tested code in very hot loops, as it is indeed quite easily misused), it seems like a bit of a stretch to say that all of: - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) - systemd (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h , along with all files using MALLOC_SIZEOF_SAFE, i.e. src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, src/basic/memory-util.h, src/basic/recurse-dir.c, src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, src/shared/journal-importer.c, src/shared/varlink.c, src/test/test-alloc-util.c and src/test/test-compress.c) - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc , along with at least 20 other uses) - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) - quickjs (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) - redis (https://github.com/redis/redis/blob/unstable/src/networking.c, along with a few other uses elsewhere) along with so many more well-known projects that I've given up on listing them, are all buggy because of their usage of malloc_usable_size... ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 19:07 ` Gabriel Ravier @ 2022-09-19 19:21 ` Rich Felker 2022-09-19 21:02 ` Gabriel Ravier 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 19:21 UTC (permalink / raw) To: Gabriel Ravier; +Cc: baiyang, James Y Knight, musl, Florian Weimer On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote: > On 9/19/22 20:14, Szabolcs Nagy wrote: > >* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: > >>I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). > >> > >>On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > >> > >>I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. > >obviously internally the implementation can use the internal chunk size... > > > >GetSize(p) is not the exact size (that the user allocated) but an internal > >size (which may be larger) and that must not be exposed *outside* of the > >malloc implementation (other than for diagnostic purposes). > > > >you can have 2 views: > > > >(1) tcmalloc and jemalloc are buggy because they expose an internal > > that must not be exposed (becaues it can break user code). > > > >(2) user code is buggy if it uses malloc_usable_size for any purpose > > other than diagnostic/statistics (because other uses are broken > > on many implementations). > > > >either way the brokenness you want to support is a security hazard > >and you are lucky that musl saves the day: it works hard not to > >expose internal sizes so the code you seem to care about can operate > >safely (which is not true on tcmalloc and jemalloc: the compiler > >may break that code). > > While I would agree that using malloc_usable_size is generally not a > great idea (it's at most acceptable as a small micro-optimization, > but I would only ever expect it to be seen in very well-tested code > in very hot loops, as it is indeed quite easily misused), it seems > like a bit of a stretch to say that all of: > > - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) > > - systemd > (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h > , along with all files using MALLOC_SIZEOF_SAFE, i.e. > src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, > src/basic/memory-util.h, src/basic/recurse-dir.c, > src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, > src/shared/journal-importer.c, src/shared/varlink.c, > src/test/test-alloc-util.c and src/test/test-compress.c) > > - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc > , along with at least 20 other uses) > > - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) > > - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) > > - quickjs > (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) > > - redis > (https://github.com/redis/redis/blob/unstable/src/networking.c, > along with a few other uses elsewhere) > > along with so many more well-known projects that I've given up on > listing them, are all buggy because of their usage of > malloc_usable_size... Depending on how you interpret the contract of malloc_usable_size (which was historically ambigious), either (1) or (2) above is *necessarily* true. It's not a matter of opinion just logical consequences of the choice you make. Moreover, it's not at all a stretch to say 7+ popular projects have gigantic UB they don't care to fix. The whole story of musl has been finding *hundreds* of such projects, and eventually getting lots of them fixed. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 19:21 ` Rich Felker @ 2022-09-19 21:02 ` Gabriel Ravier 2022-09-19 21:47 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: Gabriel Ravier @ 2022-09-19 21:02 UTC (permalink / raw) To: Rich Felker; +Cc: baiyang, James Y Knight, musl, Florian Weimer On 9/19/22 21:21, Rich Felker wrote: > On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote: >> On 9/19/22 20:14, Szabolcs Nagy wrote: >>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: >>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). >>>> >>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 >>>> >>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. >>> obviously internally the implementation can use the internal chunk size... >>> >>> GetSize(p) is not the exact size (that the user allocated) but an internal >>> size (which may be larger) and that must not be exposed *outside* of the >>> malloc implementation (other than for diagnostic purposes). >>> >>> you can have 2 views: >>> >>> (1) tcmalloc and jemalloc are buggy because they expose an internal >>> that must not be exposed (becaues it can break user code). >>> >>> (2) user code is buggy if it uses malloc_usable_size for any purpose >>> other than diagnostic/statistics (because other uses are broken >>> on many implementations). >>> >>> either way the brokenness you want to support is a security hazard >>> and you are lucky that musl saves the day: it works hard not to >>> expose internal sizes so the code you seem to care about can operate >>> safely (which is not true on tcmalloc and jemalloc: the compiler >>> may break that code). >> While I would agree that using malloc_usable_size is generally not a >> great idea (it's at most acceptable as a small micro-optimization, >> but I would only ever expect it to be seen in very well-tested code >> in very hot loops, as it is indeed quite easily misused), it seems >> like a bit of a stretch to say that all of: >> >> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) >> >> - systemd >> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h >> , along with all files using MALLOC_SIZEOF_SAFE, i.e. >> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, >> src/basic/memory-util.h, src/basic/recurse-dir.c, >> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, >> src/shared/journal-importer.c, src/shared/varlink.c, >> src/test/test-alloc-util.c and src/test/test-compress.c) >> >> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc >> , along with at least 20 other uses) >> >> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) >> >> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) >> >> - quickjs >> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) >> >> - redis >> (https://github.com/redis/redis/blob/unstable/src/networking.c, >> along with a few other uses elsewhere) >> >> along with so many more well-known projects that I've given up on >> listing them, are all buggy because of their usage of >> malloc_usable_size... > Depending on how you interpret the contract of malloc_usable_size > (which was historically ambigious), either (1) or (2) above is > *necessarily* true. It's not a matter of opinion just logical > consequences of the choice you make. Hmmmm... to me none of the two options you have given are true. Instead, I'd argue that, as seen by the C abstract machine, invocation of malloc_usable_size actually increases the size of a memory block, in place, as if by magic. This may seem stupid to some degree, but to me it is the only way one can make sense of the documentation for malloc_usable_size in glibc, as it has been present in glibc since malloc_usable_size was first added in 1996: This routine tells you how many bytes you can actually use in an allocated chunk, which may be more than you requested (although often not). You can use this many bytes without worrying about overwriting other allocated objects. Not a particularly great programming practice, but still sometimes useful. To me, this note directly indicates that the function is intended to work, within the context of the abstract machine, as if it somehow increased the size of the block at the given address. It could perhaps have been better named as something like malloc_try_extend_in_place w.r.t. interpretation of the function's behavior as relating to the abstract machine, but the name it was given probably made more sense at the time, as one may have assumed that the other name would have meant that it might do some more work that just returning the internal allocated size. This interpretation is also the one followed by pretty much every single program out there that uses malloc_usable_size, and the opposite interpretation (that malloc_usable_size is allowed to return a completely useless number that can't be used for anything useful) seems rather asinine to me Such effects on the C abstract machine may seem ludicrous, as though nothing in C should be able to do things like this, but when you search for other examples of extensions to C that do weird things with the abstract machine, you can find some quite quickly: - POSIX makes it so that storing into a function pointer by aliasing into it with a data pointer shall result in defined behavior - GNU attributes like __attribute__((may_alias)) make it so that access through pointers with types with that attribute are defined even if they try to alias normally incompatible object types - x86 intrinsics that access pointers are defined as having defined behavior even if they try to alias normally incompatible object types (the may_alias attribute above is used in GCC to implement them) - POSIX makes it so that calling signal and then making the program multi-threaded shall result in defined behavior - POSIX makes it so that calling fprintf with numbered argument conversion specifications (%n$), the apostrophe flag character (along with a few other things I'm too lazy to list) shall result in defined behavior - Many more... and yet all of these are quite often used and it is generally accepted that an implementation implementing such features should also make it so that the rules of the abstract machine, as implemented, are correctly affected and that the implementation does not suddenly make it so that using these result in UB where they claim that they implement the corresponding features. You might say that you accept the changes POSIX makes to the abstract machine, but not those made by malloc_usable_size, but I would say that to define that function is to implicitly accept the semantics of it as defined by the implementation you're trying to follow: otherwise, it seems to me as though you're implementing the function wrongly, intentionally violating the specification of the function as originally written. And yes, I would also argue that this does indeed make it so that _FORTIFY_SOURCE=3 as currently implemented is wrong (as shown in https://www.openwall.com/lists/musl/2022/09/19/23), instead of malloc_usable_size being a useless function Also, the previously cited e-mail ended up leading me to a glibc bugzilla issue, in which it looks as though glibc developers have explicitly stated they consider the value returned by malloc_usable_size to be validly usable as the object size corresponding to the pointer: https://sourceware.org/bugzilla/show_bug.cgi?id=24775#c6 > > Moreover, it's not at all a stretch to say 7+ popular projects have > gigantic UB they don't care to fix. The whole story of musl has been > finding *hundreds* of such projects, and eventually getting lots of > them fixed. > > Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 21:02 ` Gabriel Ravier @ 2022-09-19 21:47 ` Rich Felker 2022-09-19 22:31 ` Gabriel Ravier 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 21:47 UTC (permalink / raw) To: Gabriel Ravier; +Cc: baiyang, James Y Knight, musl, Florian Weimer On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote: > > On 9/19/22 21:21, Rich Felker wrote: > >On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote: > >>On 9/19/22 20:14, Szabolcs Nagy wrote: > >>>* baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: > >>>>I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). > >>>> > >>>>On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > >>>> > >>>>I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. > >>>obviously internally the implementation can use the internal chunk size... > >>> > >>>GetSize(p) is not the exact size (that the user allocated) but an internal > >>>size (which may be larger) and that must not be exposed *outside* of the > >>>malloc implementation (other than for diagnostic purposes). > >>> > >>>you can have 2 views: > >>> > >>>(1) tcmalloc and jemalloc are buggy because they expose an internal > >>> that must not be exposed (becaues it can break user code). > >>> > >>>(2) user code is buggy if it uses malloc_usable_size for any purpose > >>> other than diagnostic/statistics (because other uses are broken > >>> on many implementations). > >>> > >>>either way the brokenness you want to support is a security hazard > >>>and you are lucky that musl saves the day: it works hard not to > >>>expose internal sizes so the code you seem to care about can operate > >>>safely (which is not true on tcmalloc and jemalloc: the compiler > >>>may break that code). > >>While I would agree that using malloc_usable_size is generally not a > >>great idea (it's at most acceptable as a small micro-optimization, > >>but I would only ever expect it to be seen in very well-tested code > >>in very hot loops, as it is indeed quite easily misused), it seems > >>like a bit of a stretch to say that all of: > >> > >>- sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) > >> > >>- systemd > >>(https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h > >>, along with all files using MALLOC_SIZEOF_SAFE, i.e. > >>src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, > >>src/basic/memory-util.h, src/basic/recurse-dir.c, > >>src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, > >>src/shared/journal-importer.c, src/shared/varlink.c, > >>src/test/test-alloc-util.c and src/test/test-compress.c) > >> > >>- rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc > >>, along with at least 20 other uses) > >> > >>- folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) > >> > >>- lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) > >> > >>- quickjs > >>(https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) > >> > >>- redis > >>(https://github.com/redis/redis/blob/unstable/src/networking.c, > >>along with a few other uses elsewhere) > >> > >>along with so many more well-known projects that I've given up on > >>listing them, are all buggy because of their usage of > >>malloc_usable_size... > >Depending on how you interpret the contract of malloc_usable_size > >(which was historically ambigious), either (1) or (2) above is > >*necessarily* true. It's not a matter of opinion just logical > >consequences of the choice you make. > > Hmmmm... to me none of the two options you have given are true. > Instead, I'd argue that, as seen by the C abstract machine, > invocation of malloc_usable_size actually increases the size of a > memory block, in place, as if by magic. This may seem stupid to some > degree, but to me it is the only way one can make sense of the > documentation for malloc_usable_size in glibc, as it has been > present in glibc since malloc_usable_size was first added in 1996: I don't think that's stupid; it's a very reasonable model. Unfortunately it's one that's very hard to make the compiler aware of, and if the compiler can't be aware of the possibility of the object changing size, then it loses the ability to protect you from memory errors without having this kind of "false positive". Note that you can't just reason that "malloc_usable_size could internally have called realloc" because the compiler can see that the caller is still using the "old pointer", which is invalid, *even if realloc returned a bitwise-identical pointer* (even just comparing to see if it's equal is UB). I don't know how to make any of this non-awful. Basically, malloc_usable_size ties your hands with lots of unexpected constraints. This function alone is almost solely responsible for the musl inclusion/exclusion criteria policy for nonstandard extensions being extremely cautious about adding dubious extensions even if they "look harmless" at first -- we made the mistake of including it long ago, and don't want to repeat that. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 21:47 ` Rich Felker @ 2022-09-19 22:31 ` Gabriel Ravier 2022-09-19 22:46 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Gabriel Ravier @ 2022-09-19 22:31 UTC (permalink / raw) To: Rich Felker; +Cc: baiyang, James Y Knight, musl, Florian Weimer On 9/19/22 23:47, Rich Felker wrote: > On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote: >> On 9/19/22 21:21, Rich Felker wrote: >>> On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote: >>>> On 9/19/22 20:14, Szabolcs Nagy wrote: >>>>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: >>>>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). >>>>>> >>>>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 >>>>>> >>>>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. >>>>> obviously internally the implementation can use the internal chunk size... >>>>> >>>>> GetSize(p) is not the exact size (that the user allocated) but an internal >>>>> size (which may be larger) and that must not be exposed *outside* of the >>>>> malloc implementation (other than for diagnostic purposes). >>>>> >>>>> you can have 2 views: >>>>> >>>>> (1) tcmalloc and jemalloc are buggy because they expose an internal >>>>> that must not be exposed (becaues it can break user code). >>>>> >>>>> (2) user code is buggy if it uses malloc_usable_size for any purpose >>>>> other than diagnostic/statistics (because other uses are broken >>>>> on many implementations). >>>>> >>>>> either way the brokenness you want to support is a security hazard >>>>> and you are lucky that musl saves the day: it works hard not to >>>>> expose internal sizes so the code you seem to care about can operate >>>>> safely (which is not true on tcmalloc and jemalloc: the compiler >>>>> may break that code). >>>> While I would agree that using malloc_usable_size is generally not a >>>> great idea (it's at most acceptable as a small micro-optimization, >>>> but I would only ever expect it to be seen in very well-tested code >>>> in very hot loops, as it is indeed quite easily misused), it seems >>>> like a bit of a stretch to say that all of: >>>> >>>> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) >>>> >>>> - systemd >>>> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h >>>> , along with all files using MALLOC_SIZEOF_SAFE, i.e. >>>> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, >>>> src/basic/memory-util.h, src/basic/recurse-dir.c, >>>> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, >>>> src/shared/journal-importer.c, src/shared/varlink.c, >>>> src/test/test-alloc-util.c and src/test/test-compress.c) >>>> >>>> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc >>>> , along with at least 20 other uses) >>>> >>>> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) >>>> >>>> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) >>>> >>>> - quickjs >>>> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) >>>> >>>> - redis >>>> (https://github.com/redis/redis/blob/unstable/src/networking.c, >>>> along with a few other uses elsewhere) >>>> >>>> along with so many more well-known projects that I've given up on >>>> listing them, are all buggy because of their usage of >>>> malloc_usable_size... >>> Depending on how you interpret the contract of malloc_usable_size >>> (which was historically ambigious), either (1) or (2) above is >>> *necessarily* true. It's not a matter of opinion just logical >>> consequences of the choice you make. >> Hmmmm... to me none of the two options you have given are true. >> Instead, I'd argue that, as seen by the C abstract machine, >> invocation of malloc_usable_size actually increases the size of a >> memory block, in place, as if by magic. This may seem stupid to some >> degree, but to me it is the only way one can make sense of the >> documentation for malloc_usable_size in glibc, as it has been >> present in glibc since malloc_usable_size was first added in 1996: > I don't think that's stupid; it's a very reasonable model. > Unfortunately it's one that's very hard to make the compiler aware of, > and if the compiler can't be aware of the possibility of the object > changing size, then it loses the ability to protect you from memory > errors without having this kind of "false positive". > > Note that you can't just reason that "malloc_usable_size could > internally have called realloc" because the compiler can see that the > caller is still using the "old pointer", which is invalid, *even if > realloc returned a bitwise-identical pointer* (even just comparing to > see if it's equal is UB). > > I don't know how to make any of this non-awful. Basically, > malloc_usable_size ties your hands with lots of unexpected > constraints. This function alone is almost solely responsible for the > musl inclusion/exclusion criteria policy for nonstandard extensions > being extremely cautious about adding dubious extensions even if they > "look harmless" at first -- we made the mistake of including it long > ago, and don't want to repeat that. > > Rich Well, you've said earlier that your implementation keeps in memory the original requested amount of memory and just returns that, and that implementation seems perfectly fine: it might be going slightly against the spirit of the original function, but as you've pointed out, it's a nice way of avoiding the awfulness that malloc_usable_size results in w.r.t. the work the compiler has to do to have it work properly while not ruining object size instrumentation/optimizations (although I suppose it might lead to criticisms of it taking up valuable heap size/making programs slow by neutralising APIs used to optimize them, as seen in this thread...) My argumentation was entirely against stating that relying on malloc_usable_size to indicate the size of an object is UB: to me, libcs that can't return a valid result for it should not pretend they support it properly, and compilers that make optimizations that make it not work when it returns a higher size than the originally asked-for size should not pretend they support it properly either. GCC with _FORTIFY_SOURCE coupled with glibc seems to expose this issue, and they basically need to either: - have GCC properly take malloc_usable_size into account w.r.t. their __builtin_dynamic_object_size, which might just be a complete nightmare to implement... - have glibc warn upon its usage with _FORTIFY_SOURCE, which might end up with people complaining about _FORTIFY_SOURCE breaking their programs and disabling useful APIs... - start the long process of deprecating the function out of glibc entirely, which might end up with people complaining about glibc making their programs less fast by disabling APIs they use to do so... - or change their heap implementation so they can return the originally requested size, which might end up with people complaining about glibc making their programs take more heap space and slower by returning a "useless" value out of malloc_usable_size (w.r.t. speed optimisations)... Quite the potentially awful choice to make indeed... ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 22:31 ` Gabriel Ravier @ 2022-09-19 22:46 ` baiyang 0 siblings, 0 replies; 68+ messages in thread From: baiyang @ 2022-09-19 22:46 UTC (permalink / raw) To: Gabriel Ravier, Rich Felker; +Cc: James Y Knight, musl, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 8664 bytes --] > My argumentation was entirely against stating that relying on malloc_usable_size to indicate the size of an object is UB I totally agree with this point, in all known cases the function where the error occurs is __builtin_dynamic_object_size, not malloc_usable_size. As you referenced earlier in the bugzilla discussion (https://sourceware.org/bugzilla/show_bug.cgi?id=24775#c6): This is a bug in Msan. NOT in malloc_usable_size. So it seems to me an odd choice to add significant overhead to calls like `malloc_usable_size`, `realloc`, and `free` in order to "fix" this non-existent bug. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Gabriel Ravier Date: 2022-09-20 06:31 To: Rich Felker CC: baiyang; James Y Knight; musl; Florian Weimer Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On 9/19/22 23:47, Rich Felker wrote: > On Mon, Sep 19, 2022 at 11:02:16PM +0200, Gabriel Ravier wrote: >> On 9/19/22 21:21, Rich Felker wrote: >>> On Mon, Sep 19, 2022 at 09:07:57PM +0200, Gabriel Ravier wrote: >>>> On 9/19/22 20:14, Szabolcs Nagy wrote: >>>>> * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: >>>>>> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). >>>>>> >>>>>> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 >>>>>> >>>>>> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. >>>>> obviously internally the implementation can use the internal chunk size... >>>>> >>>>> GetSize(p) is not the exact size (that the user allocated) but an internal >>>>> size (which may be larger) and that must not be exposed *outside* of the >>>>> malloc implementation (other than for diagnostic purposes). >>>>> >>>>> you can have 2 views: >>>>> >>>>> (1) tcmalloc and jemalloc are buggy because they expose an internal >>>>> that must not be exposed (becaues it can break user code). >>>>> >>>>> (2) user code is buggy if it uses malloc_usable_size for any purpose >>>>> other than diagnostic/statistics (because other uses are broken >>>>> on many implementations). >>>>> >>>>> either way the brokenness you want to support is a security hazard >>>>> and you are lucky that musl saves the day: it works hard not to >>>>> expose internal sizes so the code you seem to care about can operate >>>>> safely (which is not true on tcmalloc and jemalloc: the compiler >>>>> may break that code). >>>> While I would agree that using malloc_usable_size is generally not a >>>> great idea (it's at most acceptable as a small micro-optimization, >>>> but I would only ever expect it to be seen in very well-tested code >>>> in very hot loops, as it is indeed quite easily misused), it seems >>>> like a bit of a stretch to say that all of: >>>> >>>> - sqlite3 (https://github.com/sqlite/sqlite/blob/master/src/mem1.c) >>>> >>>> - systemd >>>> (https://github.com/systemd/systemd/blob/main/src/basic/alloc-util.h >>>> , along with all files using MALLOC_SIZEOF_SAFE, i.e. >>>> src/basic/alloc-util.c, src/basic/compress.c, src/basic/fileio.c, >>>> src/basic/memory-util.h, src/basic/recurse-dir.c, >>>> src/basic/string-util.c, src/libsystemd/sd-netlink/netlink-socket.c, >>>> src/shared/journal-importer.c, src/shared/varlink.c, >>>> src/test/test-alloc-util.c and src/test/test-compress.c) >>>> >>>> - rocksdb (https://github.com/facebook/rocksdb/blob/main/table/block_based/filter_policy.cc >>>> , along with at least 20 other uses) >>>> >>>> - folly (https://github.com/facebook/folly/blob/main/folly/small_vector.h) >>>> >>>> - lzham_codec (https://github.com/richgel999/lzham_codec/blob/master/lzhamdecomp/lzham_mem.cpp) >>>> >>>> - quickjs >>>> (https://raw.githubusercontent.com/bellard/quickjs/master/quickjs.c) >>>> >>>> - redis >>>> (https://github.com/redis/redis/blob/unstable/src/networking.c, >>>> along with a few other uses elsewhere) >>>> >>>> along with so many more well-known projects that I've given up on >>>> listing them, are all buggy because of their usage of >>>> malloc_usable_size... >>> Depending on how you interpret the contract of malloc_usable_size >>> (which was historically ambigious), either (1) or (2) above is >>> *necessarily* true. It's not a matter of opinion just logical >>> consequences of the choice you make. >> Hmmmm... to me none of the two options you have given are true. >> Instead, I'd argue that, as seen by the C abstract machine, >> invocation of malloc_usable_size actually increases the size of a >> memory block, in place, as if by magic. This may seem stupid to some >> degree, but to me it is the only way one can make sense of the >> documentation for malloc_usable_size in glibc, as it has been >> present in glibc since malloc_usable_size was first added in 1996: > I don't think that's stupid; it's a very reasonable model. > Unfortunately it's one that's very hard to make the compiler aware of, > and if the compiler can't be aware of the possibility of the object > changing size, then it loses the ability to protect you from memory > errors without having this kind of "false positive". > > Note that you can't just reason that "malloc_usable_size could > internally have called realloc" because the compiler can see that the > caller is still using the "old pointer", which is invalid, *even if > realloc returned a bitwise-identical pointer* (even just comparing to > see if it's equal is UB). > > I don't know how to make any of this non-awful. Basically, > malloc_usable_size ties your hands with lots of unexpected > constraints. This function alone is almost solely responsible for the > musl inclusion/exclusion criteria policy for nonstandard extensions > being extremely cautious about adding dubious extensions even if they > "look harmless" at first -- we made the mistake of including it long > ago, and don't want to repeat that. > > Rich Well, you've said earlier that your implementation keeps in memory the original requested amount of memory and just returns that, and that implementation seems perfectly fine: it might be going slightly against the spirit of the original function, but as you've pointed out, it's a nice way of avoiding the awfulness that malloc_usable_size results in w.r.t. the work the compiler has to do to have it work properly while not ruining object size instrumentation/optimizations (although I suppose it might lead to criticisms of it taking up valuable heap size/making programs slow by neutralising APIs used to optimize them, as seen in this thread...) My argumentation was entirely against stating that relying on malloc_usable_size to indicate the size of an object is UB: to me, libcs that can't return a valid result for it should not pretend they support it properly, and compilers that make optimizations that make it not work when it returns a higher size than the originally asked-for size should not pretend they support it properly either. GCC with _FORTIFY_SOURCE coupled with glibc seems to expose this issue, and they basically need to either: - have GCC properly take malloc_usable_size into account w.r.t. their __builtin_dynamic_object_size, which might just be a complete nightmare to implement... - have glibc warn upon its usage with _FORTIFY_SOURCE, which might end up with people complaining about _FORTIFY_SOURCE breaking their programs and disabling useful APIs... - start the long process of deprecating the function out of glibc entirely, which might end up with people complaining about glibc making their programs less fast by disabling APIs they use to do so... - or change their heap implementation so they can return the originally requested size, which might end up with people complaining about glibc making their programs take more heap space and slower by returning a "useless" value out of malloc_usable_size (w.r.t. speed optimisations)... Quite the potentially awful choice to make indeed... [-- Attachment #2: Type: text/html, Size: 13839 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 18:14 ` Szabolcs Nagy 2022-09-19 18:40 ` baiyang 2022-09-19 19:07 ` Gabriel Ravier @ 2022-09-19 20:46 ` Nat! 2022-09-20 8:51 ` Szabolcs Nagy 2 siblings, 1 reply; 68+ messages in thread From: Nat! @ 2022-09-19 20:46 UTC (permalink / raw) To: musl On 19.09.22 20:14, Szabolcs Nagy wrote: > * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: >> I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). >> >> On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 >> >> I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. > obviously internally the implementation can use the internal chunk size... > > GetSize(p) is not the exact size (that the user allocated) but an internal > size (which may be larger) and that must not be exposed *outside* of the > malloc implementation (other than for diagnostic purposes). > > you can have 2 views: > > (1) tcmalloc and jemalloc are buggy because they expose an internal > that must not be exposed (becaues it can break user code). > > (2) user code is buggy if it uses malloc_usable_size for any purpose > other than diagnostic/statistics (because other uses are broken > on many implementations). > > either way the brokenness you want to support is a security hazard > and you are lucky that musl saves the day: it works hard not to > expose internal sizes so the code you seem to care about can operate > safely (which is not true on tcmalloc and jemalloc: the compiler > may break that code). > You can also have the third view, that malloc is allocating "at least" the amount of size requested (as it technically it is likely to do). That you can use "malloc_usable_size" to get the actually available size. That the code that is enforcing the semantics, that only the "at least" bytes should be accessed is in error, unless the error checking code modifies "malloc_usable_size" to only return the size as requested by the user. Surely not a popular opinion :D :D Ciao Nat! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:46 ` Nat! @ 2022-09-20 8:51 ` Szabolcs Nagy 0 siblings, 0 replies; 68+ messages in thread From: Szabolcs Nagy @ 2022-09-20 8:51 UTC (permalink / raw) To: Nat!; +Cc: musl * Nat! <nat@mulle-kybernetik.com> [2022-09-19 22:46:20 +0200]: > > On 19.09.22 20:14, Szabolcs Nagy wrote: > > * baiyang <baiyang@gmail.com> [2022-09-20 01:40:48 +0800]: > > > I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). > > > > > > On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > > > > > > I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. > > obviously internally the implementation can use the internal chunk size... > > > > GetSize(p) is not the exact size (that the user allocated) but an internal > > size (which may be larger) and that must not be exposed *outside* of the > > malloc implementation (other than for diagnostic purposes). > > > > you can have 2 views: > > > > (1) tcmalloc and jemalloc are buggy because they expose an internal > > that must not be exposed (becaues it can break user code). > > > > (2) user code is buggy if it uses malloc_usable_size for any purpose > > other than diagnostic/statistics (because other uses are broken > > on many implementations). > > > > either way the brokenness you want to support is a security hazard > > and you are lucky that musl saves the day: it works hard not to > > expose internal sizes so the code you seem to care about can operate > > safely (which is not true on tcmalloc and jemalloc: the compiler > > may break that code). > > > You can also have the third view, that malloc is allocating "at least" the > amount of size requested (as it technically it is likely to do). That you > can use "malloc_usable_size" to get the actually available size. That the > code that is enforcing the semantics, that only the "at least" bytes should > be accessed is in error, unless the error checking code modifies > "malloc_usable_size" to only return the size as requested by the user. > > Surely not a popular opinion :D :D that third option is (3) the compiler must not optimize based on the malloc interface contract. (just treat it like a normal extern call.) which currently only happens at -O0 or -fno-builtin-malloc. (although it seems those dont affect __builtin_dynamic_object_size) note: the compiler cannot know about all non-standard extensions and even if it did it cannot prove the absence of their call if a pointer escapes. this means lot of malloc related optimizations and diagnostics have to be turned off for everyone just in case there is a m_u_s call. while this is a valid option it's not on the libc to implement but on the compiler, which we cant control. even if compiler devs agree with (3) we still cannot expose users to security risk who use older compilers for some time. i think it's not likely gcc or clang would accept (3) by default. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 17:40 ` baiyang 2022-09-19 18:14 ` Szabolcs Nagy @ 2022-09-20 0:13 ` James Y Knight 2022-09-20 0:25 ` baiyang 1 sibling, 1 reply; 68+ messages in thread From: James Y Knight @ 2022-09-20 0:13 UTC (permalink / raw) To: baiyang; +Cc: musl, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 1703 bytes --] On Mon, Sep 19, 2022 at 1:40 PM baiyang <baiyang@gmail.com> wrote: > Hi James, > > I looked at the code of tcmalloc, but I didn't find any of the problems > you mentioned in the implementation of malloc_usable_size (see: > https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 > ). > > On the contrary, similar to musl, tcmalloc also directly uses the return > value of malloc_usable_size in its realloc implementation to determine > whether memory needs to be reallocated: > https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 > > I think this is enough to show that the return value of malloc_usable_size > in tcmalloc is accurate and reliable, otherwise its own realloc will > cause a segment fault. > It returns an accurate value, but it is undefined behavior to actually _use_ that memory. You need to first call realloc -- and when calling realloc, you (probably...) need to access the new memory via the return value of the realloc call, rather than the pointer you passed as input -- even if they happen to have the same integer representation. (Though there's certainly plenty of debate/discussion regarding the exact workings of, and advisability of, such 'pointer provenance" rules...). In any case, calling the function inside the malloc implementation to implement realloc is a different matter entirely from user code calling it. And, note that just because something works for you today doesn't mean it's correct or allowed. It just means that you've been able to get away with it. So far. Until one day it stops working. That's the really fun part of Undefined Behavior... [-- Attachment #2: Type: text/html, Size: 3338 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 0:13 ` James Y Knight @ 2022-09-20 0:25 ` baiyang 2022-09-20 0:38 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 0:25 UTC (permalink / raw) To: James Y Knight; +Cc: musl, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 2121 bytes --] As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan. Just read more on this thread :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: James Y Knight Date: 2022-09-20 08:13 To: baiyang CC: musl; Florian Weimer Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Mon, Sep 19, 2022 at 1:40 PM baiyang <baiyang@gmail.com> wrote: Hi James, I looked at the code of tcmalloc, but I didn't find any of the problems you mentioned in the implementation of malloc_usable_size (see: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 ). On the contrary, similar to musl, tcmalloc also directly uses the return value of malloc_usable_size in its realloc implementation to determine whether memory needs to be reallocated: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1499 I think this is enough to show that the return value of malloc_usable_size in tcmalloc is accurate and reliable, otherwise its own realloc will cause a segment fault. It returns an accurate value, but it is undefined behavior to actually _use_ that memory. You need to first call realloc -- and when calling realloc, you (probably...) need to access the new memory via the return value of the realloc call, rather than the pointer you passed as input -- even if they happen to have the same integer representation. (Though there's certainly plenty of debate/discussion regarding the exact workings of, and advisability of, such 'pointer provenance" rules...). In any case, calling the function inside the malloc implementation to implement realloc is a different matter entirely from user code calling it. And, note that just because something works for you today doesn't mean it's correct or allowed. It just means that you've been able to get away with it. So far. Until one day it stops working. That's the really fun part of Undefined Behavior... [-- Attachment #2: Type: text/html, Size: 5449 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 0:25 ` baiyang @ 2022-09-20 0:38 ` Rich Felker 2022-09-20 0:47 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 0:38 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 08:25:52AM +0800, baiyang wrote: > As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan. > Just read more on this thread :-) Would it be possible to limit use of the list to actually requesting help or making reports, rather than inciting debates about what is UB or what the consequences of UB might be? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 0:38 ` Rich Felker @ 2022-09-20 0:47 ` baiyang 2022-09-20 1:00 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 0:47 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 1159 bytes --] > Would it be possible to limit use of the list to actually requesting > help or making reports, rather than inciting debates about what is UB > or what the consequences of UB might be? You are right. The real question is: if we only need malloc_usable_size to return the size actually allocated internally (not the size requested by the user, **just as musl version 1.1 and all other libc implementations do**), is it possible to improve its time and space efficiency? Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 08:38 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 08:25:52AM +0800, baiyang wrote: > As Gabriel said, it's not a UB. Contrarily, it is a bug in gcc's Msan. > Just read more on this thread :-) Would it be possible to limit use of the list to actually requesting help or making reports, rather than inciting debates about what is UB or what the consequences of UB might be? [-- Attachment #2: Type: text/html, Size: 3184 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 0:47 ` baiyang @ 2022-09-20 1:00 ` Rich Felker 2022-09-20 1:18 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 1:00 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 08:47:07AM +0800, baiyang wrote: > > Would it be possible to limit use of the list to actually requesting > > help or making reports, rather than inciting debates about what is UB > > or what the consequences of UB might be? > > You are right. > > The real question is: if we only need malloc_usable_size to return > the size actually allocated internally (not the size requested by > the user, **just as musl version 1.1 and all other libc > implementations do**), is it possible to improve its time and space > efficiency? There is no hidden "size actually allocated internally". The size you get is the size you requested. Everything else is allocator data structures *outside of the object* that the caller has no entitlement to peek or poke at, and malloc_usable_size's return value reflects that. If you want to see what portion of the time is being spent on different parts of processing the metadata, you could sit down and actually run it under perf to get a profiling report/flame graph. I'm pretty sure you'll find that the final get_nominal_size step is a small portion of the time spent. get_meta is probably the majority of the time, some of it fundamental, and some of it hardening. But don't take my word for it. Measure. One thing I can tell you definitively though: if you did what the C language (which lacks malloc_usable_size) intended you to do, and kept track of the size of your own buffer, and just used that, you would spend 0% of the time you're spending on this. You would also save the entire "several hundred ms per 10 million calls" it's costing on other malloc implementations, by just *not doing something you don't need to do*. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 1:00 ` Rich Felker @ 2022-09-20 1:18 ` baiyang 2022-09-20 2:15 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 1:18 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 2883 bytes --] > There is no hidden "size actually allocated internally". The size you > get is the size you requested. Everything else is allocator data > structures *outside of the object* that the caller has no entitlement > to peek or poke at, and malloc_usable_size's return value reflects > that. If I understand correctly, according to the definition of size_classes in the mallocng code: 1. When I call `void* p = malloc(6600)`, mallocng actually allocates more than 8100 bytes of usable space, right? 2. According to your previous explanation, calling malloc_usable_size(p) at this time returns 6600, right? My question is, if malloc_usable_size(p) can directly return 8191 (or similar actual allocated size, as other libc do) instead of 6600, is it possible to make mallocng achieve higher performance both in time and space? -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 09:00 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 08:47:07AM +0800, baiyang wrote: > > Would it be possible to limit use of the list to actually requesting > > help or making reports, rather than inciting debates about what is UB > > or what the consequences of UB might be? > > You are right. > > The real question is: if we only need malloc_usable_size to return > the size actually allocated internally (not the size requested by > the user, **just as musl version 1.1 and all other libc > implementations do**), is it possible to improve its time and space > efficiency? There is no hidden "size actually allocated internally". The size you get is the size you requested. Everything else is allocator data structures *outside of the object* that the caller has no entitlement to peek or poke at, and malloc_usable_size's return value reflects that. If you want to see what portion of the time is being spent on different parts of processing the metadata, you could sit down and actually run it under perf to get a profiling report/flame graph. I'm pretty sure you'll find that the final get_nominal_size step is a small portion of the time spent. get_meta is probably the majority of the time, some of it fundamental, and some of it hardening. But don't take my word for it. Measure. One thing I can tell you definitively though: if you did what the C language (which lacks malloc_usable_size) intended you to do, and kept track of the size of your own buffer, and just used that, you would spend 0% of the time you're spending on this. You would also save the entire "several hundred ms per 10 million calls" it's costing on other malloc implementations, by just *not doing something you don't need to do*. Rich [-- Attachment #2: Type: text/html, Size: 6578 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 1:18 ` baiyang @ 2022-09-20 2:15 ` Rich Felker 2022-09-20 2:35 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 2:15 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 09:18:04AM +0800, baiyang wrote: > > There is no hidden "size actually allocated internally". The size you > > get is the size you requested. Everything else is allocator data > > structures *outside of the object* that the caller has no entitlement > > to peek or poke at, and malloc_usable_size's return value reflects > > that. > > If I understand correctly, according to the definition of size_classes in the mallocng code: > 1. When I call `void* p = malloc(6600)`, mallocng actually allocates > more than 8100 bytes of usable space, right? No, it uses space from a size-class-8176 group (~=slab) to produce an allocation of size 6600. The *allocation* is the part that belongs to the caller. Everything else is part of the allocator data structures. > 2. According to your previous explanation, calling > malloc_usable_size(p) at this time returns 6600, right? Yes. > My question is, if malloc_usable_size(p) can directly return 8191 > (or similar actual allocated size, as other libc do) instead of > 6600, is it possible to make mallocng achieve higher performance > both in time and space? No, and the reason you said you want it to does not make sense. You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case. If realloc has to allocate a new object, the amount copied will be 6600 or exactly whatever the allocated object size was (or the new size, if smaller). This is the only meaningful number. You also seem to be under the impression that the work to determine that the size was 6600 and not 8100 is where most (or at least a significant portion of) the time is spent. This is also not the case. The majority of the metadata processing time is chasing pointers back to the out-of-band metadata, validating it, validating that it round-trips back, and validating various other things. Some of these could in principle be omitted at the cost of loss-of-hardening. Figuring out that the allocation is 6600 bytes, once you already know the size class and out-of-band metadata, is quite trivial and hardly takes any of the time. (It also has a few validation checks that could be omitted at the cost of loss of hardening, but these are proportionally much smaller.) Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 2:15 ` Rich Felker @ 2022-09-20 2:35 ` baiyang 2022-09-20 3:28 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 2:35 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 4299 bytes --] > You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case. Yes, I already understood that mallocng would only memcpy 6600 bytes when I was told that malloc_usable_size will return the size requested by the user. But AFAIK, many other malloc implementations basically don't keep 6600 bytes of data. So they're actually going to memcpy the 8100 bytes. > You also seem to be under the impression that the work to determine > that the size was 6600 and not 8100 is where most (or at least a > significant portion of) the time is spent. This is also not the case. > The majority of the metadata processing time is chasing pointers back > to the out-of-band metadata, validating it, validating that it > round-trips back, and validating various other things. Some of these > could in principle be omitted at the cost of loss-of-hardening. Yes, according to my previous understanding (which seems wrong now), since other malloc_usable_size implementations that directly return 8100 (the actual allocated size class length) such as tcmalloc are all very fast, so I can only understand that mallocng is so much slower than them because it has to return 6600, not 8100. Apart from this difference, there is no reason it is slower than other implementations of malloc_usable_size as I understand it. If this is not the main reason, can we speed up this algorithm with the help of a fast lookup table mechanism like tcmalloc? As I said before, this not only greatly increases the performance of malloc_usable_size , but also the performance of realloc and free . Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 10:15 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 09:18:04AM +0800, baiyang wrote: > > There is no hidden "size actually allocated internally". The size you > > get is the size you requested. Everything else is allocator data > > structures *outside of the object* that the caller has no entitlement > > to peek or poke at, and malloc_usable_size's return value reflects > > that. > > If I understand correctly, according to the definition of size_classes in the mallocng code: > 1. When I call `void* p = malloc(6600)`, mallocng actually allocates > more than 8100 bytes of usable space, right? No, it uses space from a size-class-8176 group (~=slab) to produce an allocation of size 6600. The *allocation* is the part that belongs to the caller. Everything else is part of the allocator data structures. > 2. According to your previous explanation, calling > malloc_usable_size(p) at this time returns 6600, right? Yes. > My question is, if malloc_usable_size(p) can directly return 8191 > (or similar actual allocated size, as other libc do) instead of > 6600, is it possible to make mallocng achieve higher performance > both in time and space? No, and the reason you said you want it to does not make sense. You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case. If realloc has to allocate a new object, the amount copied will be 6600 or exactly whatever the allocated object size was (or the new size, if smaller). This is the only meaningful number. You also seem to be under the impression that the work to determine that the size was 6600 and not 8100 is where most (or at least a significant portion of) the time is spent. This is also not the case. The majority of the metadata processing time is chasing pointers back to the out-of-band metadata, validating it, validating that it round-trips back, and validating various other things. Some of these could in principle be omitted at the cost of loss-of-hardening. Figuring out that the allocation is 6600 bytes, once you already know the size class and out-of-band metadata, is quite trivial and hardly takes any of the time. (It also has a few validation checks that could be omitted at the cost of loss of hardening, but these are proportionally much smaller.) Rich [-- Attachment #2: Type: text/html, Size: 7554 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 2:35 ` baiyang @ 2022-09-20 3:28 ` Rich Felker 2022-09-20 3:53 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 3:28 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 10:35:02AM +0800, baiyang wrote: > > You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case. > > Yes, I already understood that mallocng would only memcpy 6600 bytes > when I was told that malloc_usable_size will return the size > requested by the user. > > But AFAIK, many other malloc implementations basically don't keep > 6600 bytes of data. So they're actually going to memcpy the 8100 > bytes. You have made up a problem that does not exist. Specifically, at least as far as I can tell, no such implementation exists. The ones that return some value larger than the requested size are returning "the requested size, rounded up to a multiple of 16" or similar. Not "the requested size plus 1500 bytes". For the dlmalloc-style allocators, this is because they do not have uniform slots for size classes; they have arbitrary splits, and only care about aligning start/end on a properly aligned boundary. And for more slab-like allocators, they all keep track of at least a coarse "actual size" specifically so that realloc does not gratuitously copy large amounts of unnecessary data (among other reasons). > > You also seem to be under the impression that the work to determine > > that the size was 6600 and not 8100 is where most (or at least a > > significant portion of) the time is spent. This is also not the case. > > The majority of the metadata processing time is chasing pointers back > > to the out-of-band metadata, validating it, validating that it > > round-trips back, and validating various other things. Some of these > > could in principle be omitted at the cost of loss-of-hardening. > > Yes, according to my previous understanding (which seems wrong now), > since other malloc_usable_size implementations that directly return > 8100 (the actual allocated size class length) They don't return 8100. They return something like 6608 or 6624. > such as tcmalloc are > all very fast, so I can only understand that mallocng is so much > slower than them because it has to return 6600, not 8100. This does not follow at all. tcmalloc is fast because it does not have global consistency, does not have any notable hardening, and (see the name) keeps large numbers of freed slots *cached* to reuse, thereby using lots of extra memory. Its malloc_usable_size is not fast because of returning the wrong value, if it even does return the wrong value (I have no idea). It's fast because they store the size in-band right next to the allocated memory and trust that it's valid, rather than computing it from out-of-band metadata that is not subject to falsification unless the attacker already has nearly full control of execution. > Apart from > this difference, there is no reason it is slower than other > implementations of malloc_usable_size as I understand it. Comparing the speed of malloc_usable_size is utterly pointless. That is not where the time is spent in any real-world load that's not gratuitously doing stupid things. I imagine tcmalloc's is faster, for the reasons explained above (which have nothing to do with whether the value returned is the exact size you requested). But this has nothing to do with why the overall performance is higher. > If this is not the main reason, can we speed up this algorithm with > the help of a fast lookup table mechanism like tcmalloc? As I said > before, this not only greatly increases the performance of > malloc_usable_size , but also the performance of realloc and free . No, because the fundamental difference is where you're storing the data, and the *whole point* of mallocng is that we're storing it in a place where it's not subject to attack via overflows from other objects, UAF/DF, etc. This makes it somewhat costlier (but still very reasonable) to obtain. If you have a particular application where you actually want to trade safety and low memory usage for performance, you're perfectly free to link that application to whatever malloc implementation you like. musl has supported doing that since 1.1.20. It makes no sense to do this system-wide. You would just increase memory usage and attack surface drastically with nothing to show for it, since the vast majority of programs *don't care* how long malloc takes. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 3:28 ` Rich Felker @ 2022-09-20 3:53 ` baiyang 2022-09-20 5:41 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 3:53 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 6951 bytes --] > The ones that return some value larger than the requested size are > returning "the requested size, rounded up to a multiple of 16" or > similar. Not "the requested size plus 1500 bytes". ... > They don't return 8100. They return something like 6608 or 6624. No, AFAIK, There are many allocators whose return value of malloc_usable_size is 1KB (or more) larger than the requested value at malloc time. For Example: if you do "void* p = malloc(6700)" on tcmalloc, then "malloc_usable_size(p)" will return **8192**. Far more than just "rounded up to a multiple of 16". > This does not follow at all. tcmalloc is fast because it does not have > global consistency, does not have any notable hardening, and (see the > name) keeps large numbers of freed slots *cached* to reuse, thereby > using lots of extra memory. Its malloc_usable_size is not fast because > of returning the wrong value, if it even does return the wrong value > (I have no idea). We don't need to refer to these features of tcmalloc, we only need to refer to its malloc_usable_size algorithm. > It's fast because they store the size in-band right > next to the allocated memory and trust that it's valid, rather than > computing it from out-of-band metadata that is not subject to > falsification unless the attacker already has nearly full control of > execution. No, if I understand correctly, tcmalloce doesn't store the size in-band right next to the allocated memory. On the contrary, when executing malloc_usable_size(p) (actually GetSize(p)), it will first find the size class corresponding to p through a quick lookup table, and then return the length of the size class. See: https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 My understanding: the biggest impediment to our inability to apply similar optimizations is that we have to return 6700, not 8192 (of course, you've denied this is the reason). On the other hand, if the low speed is not caused by having to return 6700, then we should be able to use a similar quick lookup table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve at least dozens of times performance improvement. Thanks :-P -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 11:28 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 10:35:02AM +0800, baiyang wrote: > > You seem to think that if the group stride was 8100, calling realloc might memcpy up to 8100 bytes. This is not the case. > > Yes, I already understood that mallocng would only memcpy 6600 bytes > when I was told that malloc_usable_size will return the size > requested by the user. > > But AFAIK, many other malloc implementations basically don't keep > 6600 bytes of data. So they're actually going to memcpy the 8100 > bytes. You have made up a problem that does not exist. Specifically, at least as far as I can tell, no such implementation exists. The ones that return some value larger than the requested size are returning "the requested size, rounded up to a multiple of 16" or similar. Not "the requested size plus 1500 bytes". For the dlmalloc-style allocators, this is because they do not have uniform slots for size classes; they have arbitrary splits, and only care about aligning start/end on a properly aligned boundary. And for more slab-like allocators, they all keep track of at least a coarse "actual size" specifically so that realloc does not gratuitously copy large amounts of unnecessary data (among other reasons). > > You also seem to be under the impression that the work to determine > > that the size was 6600 and not 8100 is where most (or at least a > > significant portion of) the time is spent. This is also not the case. > > The majority of the metadata processing time is chasing pointers back > > to the out-of-band metadata, validating it, validating that it > > round-trips back, and validating various other things. Some of these > > could in principle be omitted at the cost of loss-of-hardening. > > Yes, according to my previous understanding (which seems wrong now), > since other malloc_usable_size implementations that directly return > 8100 (the actual allocated size class length) They don't return 8100. They return something like 6608 or 6624. > such as tcmalloc are > all very fast, so I can only understand that mallocng is so much > slower than them because it has to return 6600, not 8100. This does not follow at all. tcmalloc is fast because it does not have global consistency, does not have any notable hardening, and (see the name) keeps large numbers of freed slots *cached* to reuse, thereby using lots of extra memory. Its malloc_usable_size is not fast because of returning the wrong value, if it even does return the wrong value (I have no idea). It's fast because they store the size in-band right next to the allocated memory and trust that it's valid, rather than computing it from out-of-band metadata that is not subject to falsification unless the attacker already has nearly full control of execution. > Apart from > this difference, there is no reason it is slower than other > implementations of malloc_usable_size as I understand it. Comparing the speed of malloc_usable_size is utterly pointless. That is not where the time is spent in any real-world load that's not gratuitously doing stupid things. I imagine tcmalloc's is faster, for the reasons explained above (which have nothing to do with whether the value returned is the exact size you requested). But this has nothing to do with why the overall performance is higher. > If this is not the main reason, can we speed up this algorithm with > the help of a fast lookup table mechanism like tcmalloc? As I said > before, this not only greatly increases the performance of > malloc_usable_size , but also the performance of realloc and free . No, because the fundamental difference is where you're storing the data, and the *whole point* of mallocng is that we're storing it in a place where it's not subject to attack via overflows from other objects, UAF/DF, etc. This makes it somewhat costlier (but still very reasonable) to obtain. If you have a particular application where you actually want to trade safety and low memory usage for performance, you're perfectly free to link that application to whatever malloc implementation you like. musl has supported doing that since 1.1.20. It makes no sense to do this system-wide. You would just increase memory usage and attack surface drastically with nothing to show for it, since the vast majority of programs *don't care* how long malloc takes. Rich [-- Attachment #2: Type: text/html, Size: 13675 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 3:53 ` baiyang @ 2022-09-20 5:41 ` Rich Felker 2022-09-20 5:56 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 5:41 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote: > > The ones that return some value larger than the requested size are > > returning "the requested size, rounded up to a multiple of 16" or > > similar. Not "the requested size plus 1500 bytes". > ... > > They don't return 8100. They return something like 6608 or 6624. > > No, AFAIK, There are many allocators whose return value of > malloc_usable_size is 1KB (or more) larger than the requested value > at malloc time. > For Example: if you do "void* p = malloc(6700)" on tcmalloc, then > "malloc_usable_size(p)" will return **8192**. Far more than just > "rounded up to a multiple of 16". OK, thanks for checking and correcting. > > This does not follow at all. tcmalloc is fast because it does not have > > global consistency, does not have any notable hardening, and (see the > > name) keeps large numbers of freed slots *cached* to reuse, thereby > > using lots of extra memory. Its malloc_usable_size is not fast because > > of returning the wrong value, if it even does return the wrong value > > (I have no idea). > > We don't need to refer to these features of tcmalloc, we only need > to refer to its malloc_usable_size algorithm. Those (mis)features are what provide a fast path here, regardless of whether you care about them. > > It's fast because they store the size in-band right > > next to the allocated memory and trust that it's valid, rather than > > computing it from out-of-band metadata that is not subject to > > falsification unless the attacker already has nearly full control of > > execution. > > No, if I understand correctly, tcmalloce doesn't store the size > in-band right next to the allocated memory. On the contrary, when > executing malloc_usable_size(p) (actually GetSize(p)), it will first > find the size class corresponding to p through a quick lookup table, > and then return the length of the size class. See: > https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 OK, I was confusing tcmalloc with the more conventional "thread-local freelist caching on top of dlmalloc type base" allocator strategy. Indeed tcmalloc however is one of the gigantic ones. > My understanding: the biggest impediment to our inability to apply > similar optimizations is that we have to return 6700, not 8192 (of > course, you've denied this is the reason). Your understanding is wrong. I've told you how you can measure that it's wrong. You insist on being stuck on it for no good reason. If you want to understand *why* tcmalloc is different, start with the comments at the top of the file you linked: > // 4. The pagemap (which maps from page-number to descriptor), > // can be read without holding any locks, and written while holding > // the "pageheap_lock". > // > // This multi-threaded access to the pagemap is safe for fairly > // subtle reasons. We basically assume that when an object X is > // allocated by thread A and deallocated by thread B, there must > // have been appropriate synchronization in the handoff of object > // X from thread A to thread B. This is the kind of thing I mean by lack of global consistency (no synchronization around access to these data structures) and lack of any meaningful hardening (*assuming* no memory lifetime usage errors in the calling application). The GetSize function you cited uses this global pagemap to go straight from a page address to a sizeclass, via what amounts to a two-level or three-level table indexed by upper bits of the address (comment says 3-level is only used in slower but lower-mem-use configuration). These tables, at least in the 2-level form, are utterly *massive*. I'm not sure if it creates them PROT_NONE and then only instantiates real memory for the (initially fairly sparse) parts that get used, or if it just allocates these giant things relying on overcommit, but either way that's not compatible with small-memory-space systems or with nommu. On top of that, this approach relies on laying out whole pages (likely large slabs of many pages at a time) of identical-sized objects so that the size and other properties can be looked up by page number. I have not looked into the details of "how bad" it gets, but it completely precludes having any small processes, and precludes promptly returning freed memory to the system, since *changing* the pagemap is going to be costly and they're going to avoid doing it (note the above comment on locking). mallocng does not have any global mapping optimizing translation from addresses to groups/metadata objects. Because we insist on global consistency (a prerequisite for being able to define strong hardening properties) and on being able to return freed memory promptly to the system, maintaining such a data structure would cost a lot more time (performance) than anything it could give, and it would make lock-free operations (like your malloc_usable_size, or trivial realloc calls) potentially require locking. Instead of using the numeric value of the address to map to metadata, we chase offsets from the object base address to the metadata, then validate that it round-trips back to conclude that we didn't just follow random junk from the caller passing an invalid/dangling pointer in, or from this data being overwritten via heap-based buffer overflows. Fundamentally, this pointer chasing is going to be a little bit more expensive than just using address bits as table indices, but not really all that much. At least half of the cost difference, and probably a lot more, is not the pointer/offset chasing but the validation (hardening). If hypothetically you wanted to turn that all off (e.g. by defining the assert macro to a no-op) you could have it be a lot faster, and still have low memory usage too. I'm not sure but for single-threaded loads I would not be surprised if it were getting close to tcmalloc speed. Just casually building with assert() defined to nop out the tests, I got double performance on your TEST2. Of course, I don't recommend doing this. But it's an interesting test for performing *measurement* (which you so far refuse to do) of what's actually making the performance differences. > On the other hand, if the low speed is not caused by having to > return 6700, then we should be able to use a similar quick lookup > table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve > at least dozens of times performance improvement. Once again, the big difference is not the "6700". The tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines 6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of the work here is in lines 6-11, mainly lines 7 and 10. I did a similar casual test removing line 12 and just returning something based on the earlier computations, and it made something like a 30% reduction in test run time (with or without the hardening asserts nopped out). Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 5:41 ` Rich Felker @ 2022-09-20 5:56 ` baiyang 2022-09-20 12:16 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 5:56 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 8157 bytes --] > // This multi-threaded access to the pagemap is safe for fairly > // subtle reasons. We basically assume that when an object X is > // allocated by thread A and deallocated by thread B, there must > // have been appropriate synchronization in the handoff of object > // X from thread A to thread B. Thanks for your information. I feel this assumption is very reasonable: you can't have one thread doing "free(p)" while another thread is accessing the block pointed to by p without any synchronization mechanism at the same time. > but either way that's not compatible with small-memory-space systems or with nommu. OK, that's reasonable. And again, thanks for your patience and time :-D -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 13:41 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote: > > The ones that return some value larger than the requested size are > > returning "the requested size, rounded up to a multiple of 16" or > > similar. Not "the requested size plus 1500 bytes". > ... > > They don't return 8100. They return something like 6608 or 6624. > > No, AFAIK, There are many allocators whose return value of > malloc_usable_size is 1KB (or more) larger than the requested value > at malloc time. > For Example: if you do "void* p = malloc(6700)" on tcmalloc, then > "malloc_usable_size(p)" will return **8192**. Far more than just > "rounded up to a multiple of 16". OK, thanks for checking and correcting. > > This does not follow at all. tcmalloc is fast because it does not have > > global consistency, does not have any notable hardening, and (see the > > name) keeps large numbers of freed slots *cached* to reuse, thereby > > using lots of extra memory. Its malloc_usable_size is not fast because > > of returning the wrong value, if it even does return the wrong value > > (I have no idea). > > We don't need to refer to these features of tcmalloc, we only need > to refer to its malloc_usable_size algorithm. Those (mis)features are what provide a fast path here, regardless of whether you care about them. > > It's fast because they store the size in-band right > > next to the allocated memory and trust that it's valid, rather than > > computing it from out-of-band metadata that is not subject to > > falsification unless the attacker already has nearly full control of > > execution. > > No, if I understand correctly, tcmalloce doesn't store the size > in-band right next to the allocated memory. On the contrary, when > executing malloc_usable_size(p) (actually GetSize(p)), it will first > find the size class corresponding to p through a quick lookup table, > and then return the length of the size class. See: > https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 OK, I was confusing tcmalloc with the more conventional "thread-local freelist caching on top of dlmalloc type base" allocator strategy. Indeed tcmalloc however is one of the gigantic ones. > My understanding: the biggest impediment to our inability to apply > similar optimizations is that we have to return 6700, not 8192 (of > course, you've denied this is the reason). Your understanding is wrong. I've told you how you can measure that it's wrong. You insist on being stuck on it for no good reason. If you want to understand *why* tcmalloc is different, start with the comments at the top of the file you linked: > // 4. The pagemap (which maps from page-number to descriptor), > // can be read without holding any locks, and written while holding > // the "pageheap_lock". > // > // This multi-threaded access to the pagemap is safe for fairly > // subtle reasons. We basically assume that when an object X is > // allocated by thread A and deallocated by thread B, there must > // have been appropriate synchronization in the handoff of object > // X from thread A to thread B. This is the kind of thing I mean by lack of global consistency (no synchronization around access to these data structures) and lack of any meaningful hardening (*assuming* no memory lifetime usage errors in the calling application). The GetSize function you cited uses this global pagemap to go straight from a page address to a sizeclass, via what amounts to a two-level or three-level table indexed by upper bits of the address (comment says 3-level is only used in slower but lower-mem-use configuration). These tables, at least in the 2-level form, are utterly *massive*. I'm not sure if it creates them PROT_NONE and then only instantiates real memory for the (initially fairly sparse) parts that get used, or if it just allocates these giant things relying on overcommit, but either way that's not compatible with small-memory-space systems or with nommu. On top of that, this approach relies on laying out whole pages (likely large slabs of many pages at a time) of identical-sized objects so that the size and other properties can be looked up by page number. I have not looked into the details of "how bad" it gets, but it completely precludes having any small processes, and precludes promptly returning freed memory to the system, since *changing* the pagemap is going to be costly and they're going to avoid doing it (note the above comment on locking). mallocng does not have any global mapping optimizing translation from addresses to groups/metadata objects. Because we insist on global consistency (a prerequisite for being able to define strong hardening properties) and on being able to return freed memory promptly to the system, maintaining such a data structure would cost a lot more time (performance) than anything it could give, and it would make lock-free operations (like your malloc_usable_size, or trivial realloc calls) potentially require locking. Instead of using the numeric value of the address to map to metadata, we chase offsets from the object base address to the metadata, then validate that it round-trips back to conclude that we didn't just follow random junk from the caller passing an invalid/dangling pointer in, or from this data being overwritten via heap-based buffer overflows. Fundamentally, this pointer chasing is going to be a little bit more expensive than just using address bits as table indices, but not really all that much. At least half of the cost difference, and probably a lot more, is not the pointer/offset chasing but the validation (hardening). If hypothetically you wanted to turn that all off (e.g. by defining the assert macro to a no-op) you could have it be a lot faster, and still have low memory usage too. I'm not sure but for single-threaded loads I would not be surprised if it were getting close to tcmalloc speed. Just casually building with assert() defined to nop out the tests, I got double performance on your TEST2. Of course, I don't recommend doing this. But it's an interesting test for performing *measurement* (which you so far refuse to do) of what's actually making the performance differences. > On the other hand, if the low speed is not caused by having to > return 6700, then we should be able to use a similar quick lookup > table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve > at least dozens of times performance improvement. Once again, the big difference is not the "6700". The tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines 6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of the work here is in lines 6-11, mainly lines 7 and 10. I did a similar casual test removing line 12 and just returning something based on the earlier computations, and it made something like a 30% reduction in test run time (with or without the hardening asserts nopped out). Rich [-- Attachment #2: Type: text/html, Size: 13177 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 5:56 ` baiyang @ 2022-09-20 12:16 ` Rich Felker 2022-09-20 17:21 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-20 12:16 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 01:56:12PM +0800, baiyang wrote: > > // This multi-threaded access to the pagemap is safe for fairly > > // subtle reasons. We basically assume that when an object X is > > // allocated by thread A and deallocated by thread B, there must > > // have been appropriate synchronization in the handoff of object > > // X from thread A to thread B. > > Thanks for your information. > I feel this assumption is very reasonable: you can't have one thread > doing "free(p)" while another thread is accessing the block pointed > to by p without any synchronization mechanism at the same time. That's a correct assumption given that the program's behavior is defined. The problem is that once you assume that, you've completely lost control over what happens when the program's behavior is undefined due to memory lifetime bugs in the application (UAF/DF/etc) and an allocator that does this will necessarily allow bugs in the lifetimes of objects that aren't inherently exploitable on the application level (don't contain pointers that will be written through that might clobber other application state, etc.) to be used to gain control over the allocator state and eventually gain control over the flow of execution through manipulating other allocated objects that are attack vectors. A hardened allocator like mallocng or hardened_malloc does not make assumptions about the validity of data reachable for clobbering through application bugs that haven't already yielded a high level of control to the attacker. Instead, the metadata assumed to be valid is at "secret locations" outside the area where application data is stored that are intended not to be determinable without already having some strong level of control over execution flow. And on top of that, it's cross-validated as much as possible. We also do validation where we can of the "in-band" data that is easily reachable by overflows in application buffers, etc. This both blocks a lot of potential exploits, and catches application bugs that could be exploitable before they turn into exploits, by having them crash on developers'/packagers' systems before they're deployed and getting the root cause investigated and fixed. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 12:16 ` Rich Felker @ 2022-09-20 17:21 ` baiyang 0 siblings, 0 replies; 68+ messages in thread From: baiyang @ 2022-09-20 17:21 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 3402 bytes --] > The problem is that once you assume that, you've completely > lost control over what happens when the program's behavior is > undefined due to memory lifetime bugs in the application (UAF/DF/etc) Yes, but that's exactly the contract between the C language and the programmer: to leave everything in the hands of people for maximum control and efficiency. If we need more "protection", then we can just use other more advanced (and inefficient) languages. Also, because of the enormous flexibility that C gives the programmer (and also hugely destructive when used incorrectly). So I'm afraid that the various checks we do at such a high cost may also be "better than nothing" to prevent bugs? Of course, these tradeoffs are indeed a delicate matter. It's hard to have a definitive answer. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 20:16 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 01:56:12PM +0800, baiyang wrote: > > // This multi-threaded access to the pagemap is safe for fairly > > // subtle reasons. We basically assume that when an object X is > > // allocated by thread A and deallocated by thread B, there must > > // have been appropriate synchronization in the handoff of object > > // X from thread A to thread B. > > Thanks for your information. > I feel this assumption is very reasonable: you can't have one thread > doing "free(p)" while another thread is accessing the block pointed > to by p without any synchronization mechanism at the same time. That's a correct assumption given that the program's behavior is defined. The problem is that once you assume that, you've completely lost control over what happens when the program's behavior is undefined due to memory lifetime bugs in the application (UAF/DF/etc) and an allocator that does this will necessarily allow bugs in the lifetimes of objects that aren't inherently exploitable on the application level (don't contain pointers that will be written through that might clobber other application state, etc.) to be used to gain control over the allocator state and eventually gain control over the flow of execution through manipulating other allocated objects that are attack vectors. A hardened allocator like mallocng or hardened_malloc does not make assumptions about the validity of data reachable for clobbering through application bugs that haven't already yielded a high level of control to the attacker. Instead, the metadata assumed to be valid is at "secret locations" outside the area where application data is stored that are intended not to be determinable without already having some strong level of control over execution flow. And on top of that, it's cross-validated as much as possible. We also do validation where we can of the "in-band" data that is easily reachable by overflows in application buffers, etc. This both blocks a lot of potential exploits, and catches application bugs that could be exploitable before they turn into exploits, by having them crash on developers'/packagers' systems before they're deployed and getting the root cause investigated and fixed. Rich [-- Attachment #2: Type: text/html, Size: 6445 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 13:46 ` Rich Felker 2022-09-19 13:53 ` James Y Knight @ 2022-09-20 8:33 ` Florian Weimer 2022-09-20 13:54 ` Siddhesh Poyarekar 1 sibling, 1 reply; 68+ messages in thread From: Florian Weimer @ 2022-09-20 8:33 UTC (permalink / raw) To: Rich Felker; +Cc: baiyang, musl, Siddhesh Poyarekar * Rich Felker: > On Mon, Sep 19, 2022 at 02:36:41PM +0200, Florian Weimer wrote: >> * Szabolcs Nagy: >> >> > unlike musl those implementations don't return exact size nor have the >> > same security and memory fragmentation guarantees, so bad comparision. >> > >> > tcmalloc: >> > // Returns the actual number N of bytes reserved by tcmalloc for the pointer >> > // p. This number may be equal to or greater than the number of bytes >> > // requested when p was allocated. >> > // >> > // This function is just useful for statistics collection. The client must >> > // *not* read or write from the extra bytes that are indicated by this call. >> > >> > jemalloc: >> > <para>The <function>malloc_usable_size()</function> function >> > returns the usable size of the allocation pointed to by >> > <parameter>ptr</parameter>. The return value may be larger than the size >> > that was requested during allocation. The >> > <function>malloc_usable_size()</function> function is not a >> > mechanism for in-place <function>realloc()</function>; rather >> > it is provided solely as a tool for introspection purposes. Any >> > discrepancy between the requested allocation size and the size reported >> > by <function>malloc_usable_size()</function> should not be >> > depended on, since such behavior is entirely implementation-dependent. >> >> These implementations are buggy or at least mis-documented. The >> interface contract is clearly that for that particular object, the extra >> bytes in the allocation are available for reading and writing. It is >> not guaranteed that the allocator will always provide the same number of >> extra bytes for the same requested size, but they must be there for the >> allocation being examined. It's even in the name of the function! > > I'm not sure I understand what you're saying, but the core problem > that really can't be solved is potential discrepancy between the > malloc implementation's idea of usable and the compiler's. For > example: > > char *p = malloc(1); > if (malloc_usable_size(p)>1) p[1] = 42; > > will cause a compiler that's actively detecting UB to abort the > program when malloc_usable_size returns a value larger than 1. The compiler needs to treat malloc_usable_size similar to realloc and just the size information for the buffer based on the return value from malloc_usable_size. This is admittedly harder to do than a comparable analysis for realloc if the compiler interprets the standard in such a way that after a successful realloc, any access to the original pointer value is undefined. malloc_usable_size is not actually *that* useful with allocators that do not have strict size classes because they do not over-allocate that much. For these allocators, it may be possible to increase the size of allocation significantly without moving it, but that is not reflected in the return value of malloc_usable_size at all. Thanks, Florian ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 8:33 ` Florian Weimer @ 2022-09-20 13:54 ` Siddhesh Poyarekar 2022-09-20 16:59 ` James Y Knight ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Siddhesh Poyarekar @ 2022-09-20 13:54 UTC (permalink / raw) To: Florian Weimer; +Cc: Rich Felker, baiyang, musl On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote: > The compiler needs to treat malloc_usable_size similar to realloc and > just the size information for the buffer based on the return value from > malloc_usable_size. This is admittedly harder to do than a comparable > analysis for realloc if the compiler interprets the standard in such a > way that after a successful realloc, any access to the original pointer > value is undefined. > > malloc_usable_size is not actually *that* useful with allocators that do > not have strict size classes because they do not over-allocate that > much. For these allocators, it may be possible to increase the size of > allocation significantly without moving it, but that is not reflected in > the return value of malloc_usable_size at all. So the glibc manual does not document malloc_usable_size semantics (which is weird since it is, well, a GNU extension!) so the only reference users have is the man page. The man page already discourages use of malloc_usable_size to write beyond the allocated size in the NOTES section: The value returned by malloc_usable_size() may be greater than the requested size of the allocation because of alignment and minimum size constraints. Although the excess bytes can be overwritten by the application without ill effects, this is not good programming practice: the number of excess bytes in an allocation depends on the underlying implementation. Adding support for something that's already declared as bad programming practice seems like a step backwards. Instead, I hope we find a way to discourage active use of malloc_usable_size more strongly. At least based on the systemd experience, the problem they try to solve is that of glibc realloc being too slow for paths where the reallocation should return the same block and that should be easy to special-case. Is there any other valid reason to use malloc_usable_size instead of simply using realloc? Sid ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 13:54 ` Siddhesh Poyarekar @ 2022-09-20 16:59 ` James Y Knight 2022-09-20 17:34 ` Szabolcs Nagy 2022-09-20 17:39 ` baiyang 2022-09-20 17:28 ` baiyang 2022-10-10 14:13 ` Florian Weimer 2 siblings, 2 replies; 68+ messages in thread From: James Y Knight @ 2022-09-20 16:59 UTC (permalink / raw) To: musl; +Cc: Florian Weimer, Rich Felker, baiyang [-- Attachment #1: Type: text/plain, Size: 2249 bytes --] On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > Adding support for something that's already declared as bad > programming practice seems like a step backwards. Instead, I hope we > find a way to discourage active use of malloc_usable_size more > strongly. BTW, if folks aren't aware, there is already work on the C++ side to expose an API which lets you request a heap allocation of _at least_ the given size, which rounds the actual size up in whatever way the allocator likes, and returns the pointer and actual size allocated. With this API, you declare an explicit intent that all of the memory -- up to the returned size -- is valid to use without needing to go back to the allocator to ask for more. The proposal is still making its way through the standardization process, but hopefully it'll make it into the next version of C++ after C++23. (Of course, that's not a sure thing until it happens.) Here's the doc, with more rationale/etc: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html Also, as noted in the doc, jemalloc experimentally implemented this functionality in its non-standard API, via a function it called "smallocx" -- though jemalloc hides the API so it can't be used by default. The API is effectively: typedef struct { void *ptr; size_t size; } smallocx_return_t; smallocx_return_t smallocx(size_t size, int flags); https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414 (That's consistent with jemalloc's other non-standard APIs, which stick alignment/etc into a "flags" argument, but probably not suitable for a more-standardized cross-implementation API) tcmalloc implements similar functionality, as well, with family of functions named "tcmalloc_size_returning_operator_new": https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549 which of course also isn't a suitable API to support cross-implementation. If someone wants to push forward this area, IMO, it would be really great to have an API exposing this functionality designed to be implemented in a common way across libc malloc implementations -- and eventually added to POSIX or C. [-- Attachment #2: Type: text/html, Size: 3159 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 16:59 ` James Y Knight @ 2022-09-20 17:34 ` Szabolcs Nagy 2022-09-20 19:53 ` James Y Knight 2022-09-24 8:55 ` Fangrui Song 2022-09-20 17:39 ` baiyang 1 sibling, 2 replies; 68+ messages in thread From: Szabolcs Nagy @ 2022-09-20 17:34 UTC (permalink / raw) To: James Y Knight; +Cc: musl, Florian Weimer, Rich Felker, baiyang * James Y Knight <jyknight@google.com> [2022-09-20 12:59:00 -0400]: > On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com> > wrote: > > > Adding support for something that's already declared as bad > > programming practice seems like a step backwards. Instead, I hope we > > find a way to discourage active use of malloc_usable_size more > > strongly. > > > BTW, if folks aren't aware, there is already work on the C++ side to expose > an API which lets you request a heap allocation of _at least_ the given > size, which rounds the actual size up in whatever way the allocator likes, > and returns the pointer and actual size allocated. With this API, you > declare an explicit intent that all of the memory -- up to the returned > size -- is valid to use without needing to go back to the allocator to ask > for more. > > The proposal is still making its way through the standardization process, > but hopefully it'll make it into the next version of C++ after C++23. (Of > course, that's not a sure thing until it happens.) Here's the doc, with > more rationale/etc: > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html this does not seem to discuss how existing applications that override new() would cope with this. nor how existing implementations on top of c allocators would implement it (given that we just agreed that malloc_usable_size is not suitable for such use). nor how existing allocator tooling (interposers, profilers) would handle the new interface. > > Also, as noted in the doc, jemalloc experimentally implemented this > functionality in its non-standard API, via a function it called "smallocx" > -- though jemalloc hides the API so it can't be used by default. The API is > effectively: > typedef struct { void *ptr; size_t size; } smallocx_return_t; > smallocx_return_t smallocx(size_t size, int flags); > https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414 > (That's consistent with jemalloc's other non-standard APIs, which stick > alignment/etc into a "flags" argument, but probably not suitable for a > more-standardized cross-implementation API) > > tcmalloc implements similar functionality, as well, with family of > functions named "tcmalloc_size_returning_operator_new": so there are already incompatible c apis, which means this should not be considered a viable proposal at this point. > > https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549 > which of course also isn't a suitable API to support cross-implementation. > > If someone wants to push forward this area, IMO, it would be really great > to have an API exposing this functionality designed to be implemented in a > common way across libc malloc implementations -- and eventually added to > POSIX or C. this is done the wrong way around. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 17:34 ` Szabolcs Nagy @ 2022-09-20 19:53 ` James Y Knight 2022-09-24 8:55 ` Fangrui Song 1 sibling, 0 replies; 68+ messages in thread From: James Y Knight @ 2022-09-20 19:53 UTC (permalink / raw) To: James Y Knight, musl, Florian Weimer, Rich Felker, baiyang [-- Attachment #1: Type: text/plain, Size: 2048 bytes --] On Tue, Sep 20, 2022 at 1:34 PM Szabolcs Nagy <nsz@port70.net> wrote: > this does not seem to discuss how existing applications > that override new() would cope with this. I'm certain this isn't the right forum to discuss the C++-specific parts of the proposal, nor am I the right person to do so even if it were. I brought it up to potentially inform the discussion of C library APIs, not to debate whether the C++ proposal itself is ready (it's clearly not: it didn't make it into C++23). nor how existing implementations on top of c allocators > would implement it (given that we just agreed that > malloc_usable_size is not suitable for such use). As the doc says, implementations should use the trivial implementation which returns exactly the requested size (e.g. `return {::operator new(size), size};`), if no better option is available in the underlying allocator. nor how existing allocator tooling (interposers, profilers) > would handle the new interface. > User replacement of the C++ global allocators is supported in the standard, but interposition of malloc is only an implementation-defined extension. In practice (and as documented by glibc), a malloc interposer must interpose upon the complete set of malloc-APIs which are ever used anywhere in the binary. (Of course, the base malloc implementation CAN harden itself against incomplete interposition if it wishes -- as musl does to support interposers that fail to define aligned_alloc.) > If someone wants to push forward this area, IMO, it would be really great > > to have an API exposing this functionality designed to be implemented in > a > > common way across libc malloc implementations -- and eventually added to > > POSIX or C. this is done the wrong way around. Eh. Standardization often goes better when there is already concrete implementation experience, and some level of agreement, rather than doing brand new design. That said, I didn't intend to prescribe how someone else should best go about making this happen, if indeed anyone wishes to do so. [-- Attachment #2: Type: text/html, Size: 3316 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 17:34 ` Szabolcs Nagy 2022-09-20 19:53 ` James Y Knight @ 2022-09-24 8:55 ` Fangrui Song 1 sibling, 0 replies; 68+ messages in thread From: Fangrui Song @ 2022-09-24 8:55 UTC (permalink / raw) To: James Y Knight, musl, Florian Weimer, Rich Felker, baiyang On 2022-09-20, Szabolcs Nagy wrote: >* James Y Knight <jyknight@google.com> [2022-09-20 12:59:00 -0400]: > >> On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com> >> wrote: >> >> > Adding support for something that's already declared as bad >> > programming practice seems like a step backwards. Instead, I hope we >> > find a way to discourage active use of malloc_usable_size more >> > strongly. >> >> >> BTW, if folks aren't aware, there is already work on the C++ side to expose >> an API which lets you request a heap allocation of _at least_ the given >> size, which rounds the actual size up in whatever way the allocator likes, >> and returns the pointer and actual size allocated. With this API, you >> declare an explicit intent that all of the memory -- up to the returned >> size -- is valid to use without needing to go back to the allocator to ask >> for more. >> >> The proposal is still making its way through the standardization process, >> but hopefully it'll make it into the next version of C++ after C++23. (Of >> course, that's not a sure thing until it happens.) Here's the doc, with >> more rationale/etc: >> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html > >this does not seem to discuss how existing applications >that override new() would cope with this. > >nor how existing implementations on top of c allocators >would implement it (given that we just agreed that >malloc_usable_size is not suitable for such use). > >nor how existing allocator tooling (interposers, profilers) >would handle the new interface. > >> >> Also, as noted in the doc, jemalloc experimentally implemented this >> functionality in its non-standard API, via a function it called "smallocx" >> -- though jemalloc hides the API so it can't be used by default. The API is >> effectively: >> typedef struct { void *ptr; size_t size; } smallocx_return_t; >> smallocx_return_t smallocx(size_t size, int flags); >> https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414 >> (That's consistent with jemalloc's other non-standard APIs, which stick >> alignment/etc into a "flags" argument, but probably not suitable for a >> more-standardized cross-implementation API) >> >> tcmalloc implements similar functionality, as well, with family of >> functions named "tcmalloc_size_returning_operator_new": > >so there are already incompatible c apis, which means this >should not be considered a viable proposal at this point. Small addition: https://wg21.link/P0401R6 (allocate_at_least) has made it into C++23. https://reviews.llvm.org/D122877 libc++ has implemented it in the trivial way that just returns the user-requested size. >> https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549 >> which of course also isn't a suitable API to support cross-implementation. >> >> If someone wants to push forward this area, IMO, it would be really great >> to have an API exposing this functionality designed to be implemented in a >> common way across libc malloc implementations -- and eventually added to >> POSIX or C. > >this is done the wrong way around. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 16:59 ` James Y Knight 2022-09-20 17:34 ` Szabolcs Nagy @ 2022-09-20 17:39 ` baiyang 2022-09-20 18:12 ` Quentin Rameau 1 sibling, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 17:39 UTC (permalink / raw) To: James Y Knight, musl; +Cc: Florian Weimer, Rich Felker [-- Attachment #1: Type: text/plain, Size: 3426 bytes --] > tcmalloc implements similar functionality, as well, with family of functions named "tcmalloc_size_returning_operator_new" This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**. As in our previous example, suppose we "void* p malloc(200KB)" and "malloc_usable_size(p)" returns 256KB. For allocators such as tcmalloc, if we do "realloc(p, 300KB)", it will actually execute "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time, the actual business of the application layer often only needs to copy a small amount of content, such as the first 2KB of data. Therefore, it would be great if there were additional parameters to inform realloc to just execute malloc(300KB)+memcpy(**2KB**)+free(256KB) when degenerating back to malloc+memcpy+free. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: James Y Knight Date: 2022-09-21 00:59 To: musl CC: Florian Weimer; Rich Felker; baiyang Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 9:58 AM Siddhesh Poyarekar <siddhesh@redhat.com> wrote: Adding support for something that's already declared as bad programming practice seems like a step backwards. Instead, I hope we find a way to discourage active use of malloc_usable_size more strongly. BTW, if folks aren't aware, there is already work on the C++ side to expose an API which lets you request a heap allocation of _at least_ the given size, which rounds the actual size up in whatever way the allocator likes, and returns the pointer and actual size allocated. With this API, you declare an explicit intent that all of the memory -- up to the returned size -- is valid to use without needing to go back to the allocator to ask for more. The proposal is still making its way through the standardization process, but hopefully it'll make it into the next version of C++ after C++23. (Of course, that's not a sure thing until it happens.) Here's the doc, with more rationale/etc: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p0901r9.html Also, as noted in the doc, jemalloc experimentally implemented this functionality in its non-standard API, via a function it called "smallocx" -- though jemalloc hides the API so it can't be used by default. The API is effectively: typedef struct { void *ptr; size_t size; } smallocx_return_t; smallocx_return_t smallocx(size_t size, int flags); https://github.com/jemalloc/jemalloc/blob/a0734fd6ee326cd2059edbe4bca7092988a63684/src/jemalloc.c#L3414 (That's consistent with jemalloc's other non-standard APIs, which stick alignment/etc into a "flags" argument, but probably not suitable for a more-standardized cross-implementation API) tcmalloc implements similar functionality, as well, with family of functions named "tcmalloc_size_returning_operator_new": https://github.com/google/tcmalloc/blob/267aa2ec2817ab9d09b3fbb65ecb90193dd4348e/tcmalloc/malloc_extension.h#L549 which of course also isn't a suitable API to support cross-implementation. If someone wants to push forward this area, IMO, it would be really great to have an API exposing this functionality designed to be implemented in a common way across libc malloc implementations -- and eventually added to POSIX or C. [-- Attachment #2: Type: text/html, Size: 6762 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 17:39 ` baiyang @ 2022-09-20 18:12 ` Quentin Rameau 2022-09-20 18:19 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: Quentin Rameau @ 2022-09-20 18:12 UTC (permalink / raw) To: musl; +Cc: Florian Weimer, Rich Felker > This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**. > > As in our previous example, suppose we "void* p malloc(200KB)" and "malloc_usable_size(p)" returns 256KB. For allocators such as tcmalloc, if we do "realloc(p, 300KB)", it will actually execute "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time, the actual business of the application layer often only needs to copy a small amount of content, such as the first 2KB of data. So what you're actually trying to do is more clear now. And this is something that you should do on the “application layer”, not expect the libc to magically taking care of this. If your application already knows it only needs to keep back only a few amount of data, do this in a wrapper, malloc(300KB), memcpy(2KB), free old memory. If on the other hand it know it actually wants to reuse the same data but in a more vast contiguous memory, use realloc. > Therefore, it would be great if there were additional parameters to inform realloc to just execute malloc(300KB)+memcpy(**2KB**)+free(256KB) when degenerating back to malloc+memcpy+free. That's what your wrapper would be. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:12 ` Quentin Rameau @ 2022-09-20 18:19 ` Rich Felker 2022-09-20 18:26 ` Alexander Monakov 2022-09-21 10:15 ` [musl] " 王志强 0 siblings, 2 replies; 68+ messages in thread From: Rich Felker @ 2022-09-20 18:19 UTC (permalink / raw) To: Quentin Rameau; +Cc: musl, Florian Weimer On Tue, Sep 20, 2022 at 08:12:44PM +0200, Quentin Rameau wrote: > > This is a good thing, we noticed it before. But we actually need a realloc that **can specify the copy range**. > > > > As in our previous example, suppose we "void* p malloc(200KB)" and > > "malloc_usable_size(p)" returns 256KB. For allocators such as > > tcmalloc, if we do "realloc(p, 300KB)", it will actually execute > > "malloc(300KB)+memcpy(**256KB**)+free(256KB)". But at this time, > > the actual business of the application layer often only needs to > > copy a small amount of content, such as the first 2KB of data. > > So what you're actually trying to do is more clear now. > And this is something that you should do on the “application layer”, > not expect the libc to magically taking care of this. Exactly. This can be done entirely at the application layer just by keeping track of the size you allocated. In the above example, the number 256 kB is a red herring. Yes the "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful when you only want to preserve the first 2K, and you can make the decision that it would be wasteful, and that you instead just want to allocate a new buffer yourself and memcpy 2K, just by knowing the original 200KB, without any knowledge of malloc_usable_size. As a bonus, this will be even faster than the fastest possible malloc_usable_size implementation. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:19 ` Rich Felker @ 2022-09-20 18:26 ` Alexander Monakov 2022-09-20 18:35 ` baiyang 2022-09-20 18:37 ` Quentin Rameau 2022-09-21 10:15 ` [musl] " 王志强 1 sibling, 2 replies; 68+ messages in thread From: Alexander Monakov @ 2022-09-20 18:26 UTC (permalink / raw) To: musl; +Cc: Quentin Rameau, Florian Weimer On Tue, 20 Sep 2022, Rich Felker wrote: > Exactly. This can be done entirely at the application layer just by > keeping track of the size you allocated. In the above example, the > number 256 kB is a red herring. Yes the > "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the > "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful > when you only want to preserve the first 2K, and you can make the > decision that it would be wasteful, and that you instead just want to > allocate a new buffer yourself and memcpy 2K, just by knowing the > original 200KB, without any knowledge of malloc_usable_size. They want to know if realloc will resize the allocation in-place so the internal memcpy will not happen. AIUI, what they really need is not "usable_size", but "cost estimation for resizing allocation at pointer P to size S". Which I believe they try to deduce from malloc_usable_size. Alexander ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:26 ` Alexander Monakov @ 2022-09-20 18:35 ` baiyang 2022-09-20 20:33 ` Gabriel Ravier 2022-09-20 18:37 ` Quentin Rameau 1 sibling, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 18:35 UTC (permalink / raw) To: musl; +Cc: Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 2096 bytes --] > So what you're actually trying to do is more clear now. > And this is something that you should do on the “application layer”, > not expect the libc to magically taking care of this. Yes, we just do this at the application layer with the help of APIs such as malloc_usable_size. You can look at the previous emails, isn't that what we're discussing? > They want to know if realloc will resize the allocation in-place so > the internal memcpy will not happen. > AIUI, what they really need is not "usable_size", but "cost estimation > for resizing allocation at pointer P to size S". Which I believe they > try to deduce from malloc_usable_size. Yes, from malloc_usable_size, the size that the application layer actually needs to copy, and other parameters like the OS, allocator, etc. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Alexander Monakov Date: 2022-09-21 02:26 To: musl CC: Quentin Rameau; Florian Weimer Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, 20 Sep 2022, Rich Felker wrote: > Exactly. This can be done entirely at the application layer just by > keeping track of the size you allocated. In the above example, the > number 256 kB is a red herring. Yes the > "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the > "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful > when you only want to preserve the first 2K, and you can make the > decision that it would be wasteful, and that you instead just want to > allocate a new buffer yourself and memcpy 2K, just by knowing the > original 200KB, without any knowledge of malloc_usable_size. They want to know if realloc will resize the allocation in-place so the internal memcpy will not happen. AIUI, what they really need is not "usable_size", but "cost estimation for resizing allocation at pointer P to size S". Which I believe they try to deduce from malloc_usable_size. Alexander [-- Attachment #2: Type: text/html, Size: 4559 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:35 ` baiyang @ 2022-09-20 20:33 ` Gabriel Ravier 2022-09-20 20:45 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Gabriel Ravier @ 2022-09-20 20:33 UTC (permalink / raw) To: musl, baiyang; +Cc: Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 1228 bytes --] On 9/20/22 20:35, baiyang wrote: > > So what you're actually trying to do is more clear now. > > And this is something that you should do on the “application layer”, > > not expect the libc to magically taking care of this. > > Yes, we just do this at the application layer with the help of APIs > such as malloc_usable_size. You can look at the previous emails, isn't > that what we're discussing? > > > They want to know if realloc will resize the allocation in-place so > > the internal memcpy will not happen. > > AIUI, what they really need is not "usable_size", but "cost estimation > > for resizing allocation at pointer P to size S". Which I believe they > > try to deduce from malloc_usable_size. > > Yes, from malloc_usable_size, the size that the application layer > actually needs to copy, and other parameters like the OS, allocator, etc. > At this point it feels more and more like malloc basically can't fulfill your needs unless every implementation out there starts adding specialized methods for your application. In other words, it might be more productive for you to simply bring in a more agreeable allocator, instead of trying to get every single allocator out there to bow down to your needs. [-- Attachment #2: Type: text/html, Size: 2780 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 20:33 ` Gabriel Ravier @ 2022-09-20 20:45 ` baiyang 2022-09-21 8:42 ` NRK 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 20:45 UTC (permalink / raw) To: Gabriel Ravier, musl; +Cc: Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 2390 bytes --] > At this point it feels more and more like malloc basically can't fulfill your needs unless every implementation out there starts adding specialized methods for your application. In other words, it might be more productive for you to simply bring in a more agreeable allocator, instead of trying to get every single allocator out there to bow down to your needs. Yes, we will adapt as many allocators as possible to make our software more portable and adaptable. Frankly, so far, we've only encountered malloc_usable_size performance issues with musl's mallocng (and, as we discussed earlier, this performance issue affects both realloc and free). Of course, we also respect and understand the author's considerations and trade-offs, and have avoided calling malloc_usable_size for musl's mallocng in our software. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Gabriel Ravier Date: 2022-09-21 04:33 To: musl; baiyang CC: Quentin Rameau; Florian Weimer Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On 9/20/22 20:35, baiyang wrote: > So what you're actually trying to do is more clear now. > And this is something that you should do on the “application layer”, > not expect the libc to magically taking care of this. Yes, we just do this at the application layer with the help of APIs such as malloc_usable_size. You can look at the previous emails, isn't that what we're discussing? > They want to know if realloc will resize the allocation in-place so > the internal memcpy will not happen. > AIUI, what they really need is not "usable_size", but "cost estimation > for resizing allocation at pointer P to size S". Which I believe they > try to deduce from malloc_usable_size. Yes, from malloc_usable_size, the size that the application layer actually needs to copy, and other parameters like the OS, allocator, etc. At this point it feels more and more like malloc basically can't fulfill your needs unless every implementation out there starts adding specialized methods for your application. In other words, it might be more productive for you to simply bring in a more agreeable allocator, instead of trying to get every single allocator out there to bow down to your needs. [-- Attachment #2: Type: text/html, Size: 5654 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 20:45 ` baiyang @ 2022-09-21 8:42 ` NRK 0 siblings, 0 replies; 68+ messages in thread From: NRK @ 2022-09-21 8:42 UTC (permalink / raw) To: musl On Wed, Sep 21, 2022 at 04:45:59AM +0800, baiyang wrote: > Yes, we will adapt as many allocators as possible to make our software > more portable and adaptable. > > Of course, we also respect and understand the author's considerations > and trade-offs, and have avoided calling malloc_usable_size for musl's > mallocng in our software. Here's a better idea, malloc (and malloc_usable_size) cannot be slow if you never call it; so just skip it all together and grab the needed pages straight from the kernel (mmap(2) on linux/posix and on windows I think it's VirtualAlloc). The malloc API was designed for ease of use, not performance. And it's evident from the fact that user has no control over malloc's metadata. And what you're doing with malloc_usable_size() is not an "optimization", it's a heuristic at best and hack that'll actually *pessimize* at worse. Proof? This very thread. Also malloc_usable_size() does not tell you anything about weather realloc will grow in place (or avoid copy entirely via virtual address space shenanigans). If you need such fine grained control then you're using the wrong API, you either need to manage your memory directly or find an allocator library which *actually* (as in documented in the API contract) offers such control rather than relying on unreliable heuristics. - NRK ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:26 ` Alexander Monakov 2022-09-20 18:35 ` baiyang @ 2022-09-20 18:37 ` Quentin Rameau 1 sibling, 0 replies; 68+ messages in thread From: Quentin Rameau @ 2022-09-20 18:37 UTC (permalink / raw) To: musl Hi Alexander, > > Exactly. This can be done entirely at the application layer just by > > keeping track of the size you allocated. In the above example, the > > number 256 kB is a red herring. Yes the > > "malloc(300KB)+memcpy(256KB)+free(256KB)" is wasteful, but the > > "malloc(300KB)+memcpy(200KB)+free(200KB)" would be comparably wasteful > > when you only want to preserve the first 2K, and you can make the > > decision that it would be wasteful, and that you instead just want to > > allocate a new buffer yourself and memcpy 2K, just by knowing the > > original 200KB, without any knowledge of malloc_usable_size. > > They want to know if realloc will resize the allocation in-place so > the internal memcpy will not happen. > > AIUI, what they really need is not "usable_size", but "cost estimation > for resizing allocation at pointer P to size S". Which I believe they > try to deduce from malloc_usable_size. I see. But then I doubt this is something you can reliably expect from any implementation, as at best, one would just have to assume that the target allocator works in one way and wouldn't change. This will result in a code tightly bound to a specific alocator because of the assumptions made about the library internals. This is not really the kind of contract one makes with the POSIX/C library in a portable way. So if the request really is to press every libc implementation into adopting one user's (or a few) self-made requirements, that still doesn't sound like a constructive conversation here. At this point if ones really needs micro-management of their memory, they should just implement their own allocator. ^ permalink raw reply [flat|nested] 68+ messages in thread
* [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 18:19 ` Rich Felker 2022-09-20 18:26 ` Alexander Monakov @ 2022-09-21 10:15 ` 王志强 2022-09-21 16:11 ` [musl] " 王志强 2022-09-21 17:15 ` [musl] " Rich Felker 1 sibling, 2 replies; 68+ messages in thread From: 王志强 @ 2022-09-21 10:15 UTC (permalink / raw) To: musl; +Cc: Quentin Rameau, Florian Weimer, dalias [-- Attachment #1: Type: text/plain, Size: 2549 bytes --] Hi Rich, I am quite interested into the topic, and made a comparation between glibc and musl with following code: #define MAXF 4096 void* tobefree[MAXF]; int main() { long long i; int v, k; size_t s, c=0; char *p; for (i=0; i<100000000L; i++) { v = rand(); s = ((v%256)+1)*1024; p = (char*) malloc(s); p[1023]=0; if (c>=MAXF) { k = v%c; free(tobefree[k]); tobefree[k]=tobefree[--c]; } tobefree[c++]=p; } return 0; } ``` The results show signaficant difference. With glibc, (running within a debian docker image) # gcc -o m.debian -O0 app_malloc.c # time ./m.debian real 0m37.529s user 0m36.677s sys 0m0.771s With musl, (runnign within a alpine3.15 docker image) # gcc -o m.alpine -O0 app_malloc.c # time ./m.alpine real 6m 30.51s user 1m 36.67s sys 4m 53.31s musl seems spend way too much time within kernel, while glibc hold most work within userspace. I used perf_event_open to profile those programs: musl profiling(total 302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap munmap(30.858% 93469/302899) _init?(22.583% 68404/302899) aligned_alloc?(89.290% 61078/68404) asm_exc_page_fault(45.961% 28072/61078) main(9.001% 6157/68404) asm_exc_page_fault(29.170% 1796/6157) rand(1.266% 866/68404) aligned_alloc?(20.437% 61904/302899) asm_exc_page_fault(56.038% 34690/61904) madvise(13.275% 40209/302899) mmap64(11.125% 33698/302899) But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free" pthread_attr_setschedparam?(82.021% 23845/29072) asm_exc_page_fault(1.657% 395/23845) _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859) cfree(58.839% 2859/4859) main(31.138% 1513/4859) asm_exc_page_fault(2.115% 32/1513) pthread_attr_setschedparam?(3.725% 181/4859) random(2.099% 102/4859) random_r(1.832% 89/4859) __libc_malloc(1.420% 69/4859) It seems to be me, glibc make lots of uasage of cache of kernel memory and avoid lots of pagefault and syscalls. Is this performance difference should concern realworld applications? On average, musl actual spend about 3~4ns per malloc/free, which is quite acceptable in realworld applications, I think. (Seems to me, that the performance difference has nothing to do with malloc_usable_size, which may be indeed just a speculative guess without any base) David Wang [-- Attachment #2: Type: text/html, Size: 6435 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* [musl] Re:[musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-21 10:15 ` [musl] " 王志强 @ 2022-09-21 16:11 ` 王志强 2022-09-21 17:15 ` [musl] " Rich Felker 1 sibling, 0 replies; 68+ messages in thread From: 王志强 @ 2022-09-21 16:11 UTC (permalink / raw) To: musl; +Cc: Quentin Rameau, Florian Weimer, dalias [-- Attachment #1: Type: text/plain, Size: 1067 bytes --] At 2022-09-21 18:15:02, "王志强" <00107082@163.com> wrote: But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free" pthread_attr_setschedparam?(82.021% 23845/29072) asm_exc_page_fault(1.657% 395/23845) _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859) cfree(58.839% 2859/4859) main(31.138% 1513/4859) asm_exc_page_fault(2.115% 32/1513) pthread_attr_setschedparam?(3.725% 181/4859) random(2.099% 102/4859) random_r(1.832% 89/4859) __libc_malloc(1.420% 69/4859) update: glibc spend most of its time within malloc, 82% samples, not the pagefault routine (only 1%), I misread the report in my last mail. __libc_malloc?(82.195% 20700/25184) asm_exc_page_fault(1.589% 329/20700) _dl_catch_error??(16.717% 4210/25184)__libc_start_main(100.000% 4210/4210) cfree(56.508% 2379/4210) main(32.660% 1375/4210) asm_exc_page_fault(2.327% 32/1375) __libc_malloc?(4.418% 186/4210) random(2.375% 100/4210) random_r(1.805% 76/4210) __libc_malloc(1.116% 47/4210) David [-- Attachment #2: Type: text/html, Size: 3852 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-21 10:15 ` [musl] " 王志强 2022-09-21 16:11 ` [musl] " 王志强 @ 2022-09-21 17:15 ` Rich Felker 2022-09-21 17:58 ` Rich Felker 1 sibling, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-21 17:15 UTC (permalink / raw) To: 王志强; +Cc: musl, Quentin Rameau, Florian Weimer On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote: > Hi Rich, > > > > I am quite interested into the topic, and made a comparation between glibc and musl with following code: > #define MAXF 4096 > void* tobefree[MAXF]; > int main() { > long long i; > int v, k; > size_t s, c=0; > char *p; > for (i=0; i<100000000L; i++) { > v = rand(); > s = ((v%256)+1)*1024; > p = (char*) malloc(s); > p[1023]=0; > if (c>=MAXF) { > k = v%c; > free(tobefree[k]); > tobefree[k]=tobefree[--c]; > } > tobefree[c++]=p; > } > return 0; > } > ``` > > The results show signaficant difference. > With glibc, (running within a debian docker image) > # gcc -o m.debian -O0 app_malloc.c > > # time ./m.debian > real 0m37.529s > user 0m36.677s > sys 0m0.771s > > With musl, (runnign within a alpine3.15 docker image) > > # gcc -o m.alpine -O0 app_malloc.c > > # time ./m.alpine > real 6m 30.51s > user 1m 36.67s > sys 4m 53.31s > > > > musl seems spend way too much time within kernel, while glibc hold most work within userspace. > I used perf_event_open to profile those programs: > musl profiling(total 302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap > > munmap(30.858% 93469/302899) > _init?(22.583% 68404/302899) > aligned_alloc?(89.290% 61078/68404) > asm_exc_page_fault(45.961% 28072/61078) > main(9.001% 6157/68404) > asm_exc_page_fault(29.170% 1796/6157) > rand(1.266% 866/68404) > aligned_alloc?(20.437% 61904/302899) > asm_exc_page_fault(56.038% 34690/61904) > madvise(13.275% 40209/302899) > mmap64(11.125% 33698/302899) > > > But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free" > > > > pthread_attr_setschedparam?(82.021% 23845/29072) > asm_exc_page_fault(1.657% 395/23845) > _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859) > cfree(58.839% 2859/4859) > main(31.138% 1513/4859) > asm_exc_page_fault(2.115% 32/1513) > pthread_attr_setschedparam?(3.725% 181/4859) > random(2.099% 102/4859) > random_r(1.832% 89/4859) > __libc_malloc(1.420% 69/4859) > It seems to be me, glibc make lots of uasage of cache of kernel > memory and avoid lots of pagefault and syscalls. > Is this performance difference should concern realworld > applications? On average, musl actual spend about 3~4ns per > malloc/free, which is quite acceptable in realworld applications, I > think. > > > > (Seems to me, that the performance difference has nothing to do with > malloc_usable_size, which may be indeed just a speculative guess > without any base) Indeed this has nothing to do with it. What you're seeing is just that musl/mallocng return freed memory, and glibc, basically, doesn't (modulo the special case of large contiguous free block at 'top' of heap). This inherently has a time cost. mallocng does make significant efforts to avoid hammering mmap/munmap under repeated malloc/free, at least in cases where it can reasonably be deemed to matter. However, this is best-effort, and always a tradeoff on (potential) large unwanted memory usage vs performance. More on this later. Your test case, with the completely random size distribution across various large sizes, is likely a worst case. The mean size you're allocating is 128k, which is the threshold for direct mmap/munmap of each allocation, so at least half of the allocations you're making can *never* be reused, and will always be immediately unmapped on free. It might be interesting to change the scaling factor from 1k to 256 bytes so that basically all of the allocation sizes are in the malloc-managed range. Now, I mentioned above "when it can reasonably be deemed to matter". One of the assumptions mallocng makes for deciding where to risk larger memory usage for the sake of performance is that you're *actually going to use* the memory you're allocating. Yes, 6min might seem like a lot of time to do nothing, but that "doing nothing" was churning through allocating 12 TB of memory. How much time would you have spent *actually processing* 12 TB of data (and not as flat linear data, but avg-128kB chunks), and what percentage of that time would the 6min in malloc have been? Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-21 17:15 ` [musl] " Rich Felker @ 2022-09-21 17:58 ` Rich Felker 2022-09-22 3:34 ` [musl] " 王志强 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-21 17:58 UTC (permalink / raw) To: 王志强; +Cc: musl, Quentin Rameau, Florian Weimer On Wed, Sep 21, 2022 at 01:15:35PM -0400, Rich Felker wrote: > On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote: > > Hi Rich, > > > > > > > > I am quite interested into the topic, and made a comparation between glibc and musl with following code: > > #define MAXF 4096 > > void* tobefree[MAXF]; > > int main() { > > long long i; > > int v, k; > > size_t s, c=0; > > char *p; > > for (i=0; i<100000000L; i++) { > > v = rand(); > > s = ((v%256)+1)*1024; > > p = (char*) malloc(s); > > p[1023]=0; > > if (c>=MAXF) { > > k = v%c; > > free(tobefree[k]); > > tobefree[k]=tobefree[--c]; > > } > > tobefree[c++]=p; > > } > > return 0; > > } > > ``` > > > > The results show signaficant difference. > > With glibc, (running within a debian docker image) > > # gcc -o m.debian -O0 app_malloc.c > > > > # time ./m.debian > > real 0m37.529s > > user 0m36.677s > > sys 0m0.771s > > > > With musl, (runnign within a alpine3.15 docker image) > > > > # gcc -o m.alpine -O0 app_malloc.c > > > > # time ./m.alpine > > real 6m 30.51s > > user 1m 36.67s > > sys 4m 53.31s > > > > > > > > musl seems spend way too much time within kernel, while glibc hold most work within userspace. > > I used perf_event_open to profile those programs: > > musl profiling(total 302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap > > > > munmap(30.858% 93469/302899) > > _init?(22.583% 68404/302899) > > aligned_alloc?(89.290% 61078/68404) > > asm_exc_page_fault(45.961% 28072/61078) > > main(9.001% 6157/68404) > > asm_exc_page_fault(29.170% 1796/6157) > > rand(1.266% 866/68404) > > aligned_alloc?(20.437% 61904/302899) > > asm_exc_page_fault(56.038% 34690/61904) > > madvise(13.275% 40209/302899) > > mmap64(11.125% 33698/302899) > > > > > > But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free" > > > > > > > > pthread_attr_setschedparam?(82.021% 23845/29072) > > asm_exc_page_fault(1.657% 395/23845) > > _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859) > > cfree(58.839% 2859/4859) > > main(31.138% 1513/4859) > > asm_exc_page_fault(2.115% 32/1513) > > pthread_attr_setschedparam?(3.725% 181/4859) > > random(2.099% 102/4859) > > random_r(1.832% 89/4859) > > __libc_malloc(1.420% 69/4859) > > It seems to be me, glibc make lots of uasage of cache of kernel > > memory and avoid lots of pagefault and syscalls. > > Is this performance difference should concern realworld > > applications? On average, musl actual spend about 3~4ns per > > malloc/free, which is quite acceptable in realworld applications, I > > think. > > > > > > > > (Seems to me, that the performance difference has nothing to do with > > malloc_usable_size, which may be indeed just a speculative guess > > without any base) > > Indeed this has nothing to do with it. What you're seeing is just that > musl/mallocng return freed memory, and glibc, basically, doesn't > (modulo the special case of large contiguous free block at 'top' of > heap). This inherently has a time cost. > > mallocng does make significant efforts to avoid hammering mmap/munmap > under repeated malloc/free, at least in cases where it can reasonably > be deemed to matter. However, this is best-effort, and always a > tradeoff on (potential) large unwanted memory usage vs performance. > More on this later. > > Your test case, with the completely random size distribution across > various large sizes, is likely a worst case. The mean size you're > allocating is 128k, which is the threshold for direct mmap/munmap of > each allocation, so at least half of the allocations you're making can > *never* be reused, and will always be immediately unmapped on free. It > might be interesting to change the scaling factor from 1k to 256 bytes > so that basically all of the allocation sizes are in the > malloc-managed range. One observation if this change is made: it looks like at least 70% of the time is spent performing madvise(MADV_FREE), and that a large portion of the rest (just looking at strace) seems to be repeatedly mapping and freeing a 17-page (68k) block, probably because this size happens to be at the boundary of some threshold where bounce protection isn't happening. I think we should look at both of these in more detail, since they both suggest opportunities for large performance improvements at low cost. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* [musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-21 17:58 ` Rich Felker @ 2022-09-22 3:34 ` 王志强 2022-09-22 9:10 ` [musl] " 王志强 0 siblings, 1 reply; 68+ messages in thread From: 王志强 @ 2022-09-22 3:34 UTC (permalink / raw) To: musl, dalias; +Cc: Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 3561 bytes --] Hi Rich, Thanks for your time. Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the average 3 microsecond per malloc&free. At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote: > > >> Your test case, with the completely random size distribution across >> various large sizes, is likely a worst case. The mean size you're >> allocating is 128k, which is the threshold for direct mmap/munmap of >> each allocation, so at least half of the allocations you're making can >> *never* be reused, and will always be immediately unmapped on free. It >> might be interesting to change the scaling factor from 1k to 256 bytes >> so that basically all of the allocation sizes are in the >> malloc-managed range. > >One observation if this change is made: it looks like at least 70% of >the time is spent performing madvise(MADV_FREE), and that a large >portion of the rest (just looking at strace) seems to be repeatedly >mapping and freeing a 17-page (68k) block, probably because this size >happens to be at the boundary of some threshold where bounce >protection isn't happening. I think we should look at both of these in >more detail, since they both suggest opportunities for large >performance improvements at low cost. > I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested, also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%: 1. average 64K(1K~128K) malloc/free # time ./m.alpine real 1m 50.12s user 0m 39.80s sys 1m 10.17s madvise(61.945% 52926/85440) __libc_start_main?(22.158% 18932/85440) malloc_usable_size?(82.870% 15689/18932) asm_exc_page_fault(2.766% 434/15689) main(16.781% 3177/18932) asm_exc_page_fault(2.487% 79/3177) malloc_usable_size?(10.969% 9372/85440) asm_exc_page_fault(6.519% 611/9372) munmap(2.449% 2092/85440) exit?(1.540% 1316/85440) 2. average 32K (1K~64K) malloc/free # time ./m.alpine real 1m 12.89s user 0m 30.62s sys 0m 41.91s madvise(60.835% 34282/56352) __libc_start_main?(27.410% 15446/56352) malloc_usable_size?(78.558% 12134/15446) main(20.996% 3243/15446) malloc_usable_size?(9.354% 5271/56352) exit?(1.888% 1064/56352) 3. average 8K (1K~16K) # time ./m.alpine real 0m 42.35s user 0m 22.94s sys 0m 19.27s madvise(49.338% 16169/32772) __libc_start_main?(36.592% 11992/32772) malloc_usable_size?(79.244% 9503/11992) main(20.480% 2456/11992) malloc_usable_size?(10.921% 3579/32772) exit?(2.591% 849/32772) 4. average 4K (1k~8K) # time ./m.debian real 0m32.477s user 0m31.829s sys 0m0.596s __libc_start_main?(44.474% 9279/20864) malloc_usable_size?(81.410% 7554/9279) main(17.987% 1669/9279) madvise(37.720% 7870/20864) malloc_usable_size?(13.986% 2918/20864) exit?(3.350% 699/20864) 5 average 2K(128B~4096B) (madviser only about 1.7%) # time ./m.alpine real 0m 13.02s user 0m 12.68s sys 0m 0.26s __libc_start_main?(69.538% 6974/10029) malloc_usable_size?(80.786% 5634/6974) main(18.569% 1295/6974) malloc_usable_size?(21.538% 2160/10029) exit?(7.060% 708/10029) madvise(1.715% 172/10029) 6. average 1K (128B~2048B) # time ./m.alpine real 0m 10.75s user 0m 10.68s sys 0m 0.01s __libc_start_main?(72.495% 6012/8293) malloc_usable_size?(76.630% 4607/6012) main(22.904% 1377/6012) malloc_usable_size?(18.823% 1561/8293) exit?(8.610% 714/8293) David [-- Attachment #2: Type: text/html, Size: 8386 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* [musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-22 3:34 ` [musl] " 王志强 @ 2022-09-22 9:10 ` 王志强 2022-09-22 9:39 ` [musl] " 王志强 0 siblings, 1 reply; 68+ messages in thread From: 王志强 @ 2022-09-22 9:10 UTC (permalink / raw) To: musl; +Cc: dalias, Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 5613 bytes --] update: I change the test code to include malloc_usable_size, just to verify the impact of this function call #define MAXF 4096 void* tobefree[MAXF]; int main() { long long i; int v, k, j; size_t s, c=0; char *p; for (i=0; i<100000000L; i++) { v = rand(); s = ((v%126)+1)*1024; p = (char*) malloc(s); for (j=0; j+1<s; j+=1024) p[j+1]=j; ////<<--poke pages s = malloc_usable_size(p); ////<<-----------------add here if (c>=MAXF) { k = v%c; free(tobefree[k]); tobefree[k]=tobefree[--c]; } tobefree[c++]=p; } return 0; } # time ./m.alpine real 7m 19.07s user 4m 53.61s sys 2m 25.22s It took about 440 seconds to finish the whole 100 million iterations (on average 4.4microseconds per malloc&free, while with glibc it took 96 seconds, average 1microsecond) the profiling report is as following _init__libc_start_main(59.374% 202450/340976) main(79.104% 160147/202450) aligned_allocmalloc_usable_size(20.098% 40689/202450) madvise(33.323% 113623/340976) aligned_allocmalloc_usable_size(6.534% 22279/340976) aligned_allocmalloc_usable_size got picked by profiler about (40689+22279)/340976 = 18.4%, should this raise any concern? (Still, madvise calls has major impact, 33.3%) (The profiling report is different from last few because I add code to poke several memory addresses after malloc) Here is the profiler report for glibc, running same code, malloc_usable_size got picked only 1% of total 75568 samples _dl_catch_error??(62.487% 47220/75568) __libc_start_main(100.000% 47220/47220) main(80.064% 37806/47220) cfree(17.274% 8157/47220) malloc_usable_size(1.355% 640/47220) pthread_attr_setschedparam?__libc_malloc(36.795% 27805/75568) My profiler code is here, https://github.com/zq-david-wang/linux-tools/tree/main/perf/profiler, it take a pid as parameter and would profile all the pids within same perf_event cgroup. At 2022-09-22 11:34:59, "王志强" <00107082@163.com> wrote: Hi Rich, Thanks for your time. Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the average 3 microsecond per malloc&free. At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote: > > >> Your test case, with the completely random size distribution across >> various large sizes, is likely a worst case. The mean size you're >> allocating is 128k, which is the threshold for direct mmap/munmap of >> each allocation, so at least half of the allocations you're making can >> *never* be reused, and will always be immediately unmapped on free. It >> might be interesting to change the scaling factor from 1k to 256 bytes >> so that basically all of the allocation sizes are in the >> malloc-managed range. > >One observation if this change is made: it looks like at least 70% of >the time is spent performing madvise(MADV_FREE), and that a large >portion of the rest (just looking at strace) seems to be repeatedly >mapping and freeing a 17-page (68k) block, probably because this size >happens to be at the boundary of some threshold where bounce >protection isn't happening. I think we should look at both of these in >more detail, since they both suggest opportunities for large >performance improvements at low cost. > I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested, also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%: 1. average 64K(1K~128K) malloc/free # time ./m.alpine real 1m 50.12s user 0m 39.80s sys 1m 10.17s madvise(61.945% 52926/85440) __libc_start_main?(22.158% 18932/85440) malloc_usable_size?(82.870% 15689/18932) asm_exc_page_fault(2.766% 434/15689) main(16.781% 3177/18932) asm_exc_page_fault(2.487% 79/3177) malloc_usable_size?(10.969% 9372/85440) asm_exc_page_fault(6.519% 611/9372) munmap(2.449% 2092/85440) exit?(1.540% 1316/85440) 2. average 32K (1K~64K) malloc/free # time ./m.alpine real 1m 12.89s user 0m 30.62s sys 0m 41.91s madvise(60.835% 34282/56352) __libc_start_main?(27.410% 15446/56352) malloc_usable_size?(78.558% 12134/15446) main(20.996% 3243/15446) malloc_usable_size?(9.354% 5271/56352) exit?(1.888% 1064/56352) 3. average 8K (1K~16K) # time ./m.alpine real 0m 42.35s user 0m 22.94s sys 0m 19.27s madvise(49.338% 16169/32772) __libc_start_main?(36.592% 11992/32772) malloc_usable_size?(79.244% 9503/11992) main(20.480% 2456/11992) malloc_usable_size?(10.921% 3579/32772) exit?(2.591% 849/32772) 4. average 4K (1k~8K) # time ./m.debian real 0m32.477s user 0m31.829s sys 0m0.596s __libc_start_main?(44.474% 9279/20864) malloc_usable_size?(81.410% 7554/9279) main(17.987% 1669/9279) madvise(37.720% 7870/20864) malloc_usable_size?(13.986% 2918/20864) exit?(3.350% 699/20864) 5 average 2K(128B~4096B) (madviser only about 1.7%) # time ./m.alpine real 0m 13.02s user 0m 12.68s sys 0m 0.26s __libc_start_main?(69.538% 6974/10029) malloc_usable_size?(80.786% 5634/6974) main(18.569% 1295/6974) malloc_usable_size?(21.538% 2160/10029) exit?(7.060% 708/10029) madvise(1.715% 172/10029) 6. average 1K (128B~2048B) # time ./m.alpine real 0m 10.75s user 0m 10.68s sys 0m 0.01s __libc_start_main?(72.495% 6012/8293) malloc_usable_size?(76.630% 4607/6012) main(22.904% 1377/6012) malloc_usable_size?(18.823% 1561/8293) exit?(8.610% 714/8293) David [-- Attachment #2: Type: text/html, Size: 13041 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* [musl] Re:[musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-22 9:10 ` [musl] " 王志强 @ 2022-09-22 9:39 ` 王志强 0 siblings, 0 replies; 68+ messages in thread From: 王志强 @ 2022-09-22 9:39 UTC (permalink / raw) To: musl; +Cc: dalias, Quentin Rameau, Florian Weimer [-- Attachment #1: Type: text/plain, Size: 6389 bytes --] Sorry, I found I made a wrong reading about profiling in last mail about malloc_usable_size, I think malloc_usable_size is indeed has no impact. The symbol "aligned_allocmalloc_usable_size" should be aligned_alloc<?>malloc_usable_size, means a function address not symbolized, and the address belongs to neither aligned_alloc nor malloc_usable_size, but is between aligned_alloc and malloc_usable_size, not sure what it belongs to (Most likely, it is some static function, somewhere) Compared with old profiling report which did not call malloc_usable_size, there were also significant addresses around that range, so... Please ignore my last mail about malloc_usable_size.....sorry.... At 2022-09-22 17:10:18, "王志强" <00107082@163.com> wrote: update: I change the test code to include malloc_usable_size, just to verify the impact of this function call #define MAXF 4096 void* tobefree[MAXF]; int main() { long long i; int v, k, j; size_t s, c=0; char *p; for (i=0; i<100000000L; i++) { v = rand(); s = ((v%126)+1)*1024; p = (char*) malloc(s); for (j=0; j+1<s; j+=1024) p[j+1]=j; ////<<--poke pages s = malloc_usable_size(p); ////<<-----------------add here if (c>=MAXF) { k = v%c; free(tobefree[k]); tobefree[k]=tobefree[--c]; } tobefree[c++]=p; } return 0; } # time ./m.alpine real 7m 19.07s user 4m 53.61s sys 2m 25.22s It took about 440 seconds to finish the whole 100 million iterations (on average 4.4microseconds per malloc&free, while with glibc it took 96 seconds, average 1microsecond) the profiling report is as following _init__libc_start_main(59.374% 202450/340976) main(79.104% 160147/202450) aligned_allocmalloc_usable_size(20.098% 40689/202450) madvise(33.323% 113623/340976) aligned_allocmalloc_usable_size(6.534% 22279/340976) aligned_allocmalloc_usable_size got picked by profiler about (40689+22279)/340976 = 18.4%, should this raise any concern? (Still, madvise calls has major impact, 33.3%) (The profiling report is different from last few because I add code to poke several memory addresses after malloc) Here is the profiler report for glibc, running same code, malloc_usable_size got picked only 1% of total 75568 samples _dl_catch_error??(62.487% 47220/75568) __libc_start_main(100.000% 47220/47220) main(80.064% 37806/47220) cfree(17.274% 8157/47220) malloc_usable_size(1.355% 640/47220) pthread_attr_setschedparam?__libc_malloc(36.795% 27805/75568) My profiler code is here, https://github.com/zq-david-wang/linux-tools/tree/main/perf/profiler, it take a pid as parameter and would profile all the pids within same perf_event cgroup. At 2022-09-22 11:34:59, "王志强" <00107082@163.com> wrote: Hi Rich, Thanks for your time. Totally agreed that in realworld application, it would take way more time to process that huge bulk of memory, compared with the average 3 microsecond per malloc&free. At 2022-09-22 01:58:17, "Rich Felker" <dalias@libc.org> wrote: > > >> Your test case, with the completely random size distribution across >> various large sizes, is likely a worst case. The mean size you're >> allocating is 128k, which is the threshold for direct mmap/munmap of >> each allocation, so at least half of the allocations you're making can >> *never* be reused, and will always be immediately unmapped on free. It >> might be interesting to change the scaling factor from 1k to 256 bytes >> so that basically all of the allocation sizes are in the >> malloc-managed range. > >One observation if this change is made: it looks like at least 70% of >the time is spent performing madvise(MADV_FREE), and that a large >portion of the rest (just looking at strace) seems to be repeatedly >mapping and freeing a 17-page (68k) block, probably because this size >happens to be at the boundary of some threshold where bounce >protection isn't happening. I think we should look at both of these in >more detail, since they both suggest opportunities for large >performance improvements at low cost. > I have made several profiling, the report indeed show that as the size decreased, performance went up significantly and madvise now take major portion of time, as you suggested, also madviser's portion decrease as size decrease, when average size reach to 2K, madviser was only picked by profiler less than 2%: 1. average 64K(1K~128K) malloc/free # time ./m.alpine real 1m 50.12s user 0m 39.80s sys 1m 10.17s madvise(61.945% 52926/85440) __libc_start_main?(22.158% 18932/85440) malloc_usable_size?(82.870% 15689/18932) asm_exc_page_fault(2.766% 434/15689) main(16.781% 3177/18932) asm_exc_page_fault(2.487% 79/3177) malloc_usable_size?(10.969% 9372/85440) asm_exc_page_fault(6.519% 611/9372) munmap(2.449% 2092/85440) exit?(1.540% 1316/85440) 2. average 32K (1K~64K) malloc/free # time ./m.alpine real 1m 12.89s user 0m 30.62s sys 0m 41.91s madvise(60.835% 34282/56352) __libc_start_main?(27.410% 15446/56352) malloc_usable_size?(78.558% 12134/15446) main(20.996% 3243/15446) malloc_usable_size?(9.354% 5271/56352) exit?(1.888% 1064/56352) 3. average 8K (1K~16K) # time ./m.alpine real 0m 42.35s user 0m 22.94s sys 0m 19.27s madvise(49.338% 16169/32772) __libc_start_main?(36.592% 11992/32772) malloc_usable_size?(79.244% 9503/11992) main(20.480% 2456/11992) malloc_usable_size?(10.921% 3579/32772) exit?(2.591% 849/32772) 4. average 4K (1k~8K) # time ./m.debian real 0m32.477s user 0m31.829s sys 0m0.596s __libc_start_main?(44.474% 9279/20864) malloc_usable_size?(81.410% 7554/9279) main(17.987% 1669/9279) madvise(37.720% 7870/20864) malloc_usable_size?(13.986% 2918/20864) exit?(3.350% 699/20864) 5 average 2K(128B~4096B) (madviser only about 1.7%) # time ./m.alpine real 0m 13.02s user 0m 12.68s sys 0m 0.26s __libc_start_main?(69.538% 6974/10029) malloc_usable_size?(80.786% 5634/6974) main(18.569% 1295/6974) malloc_usable_size?(21.538% 2160/10029) exit?(7.060% 708/10029) madvise(1.715% 172/10029) 6. average 1K (128B~2048B) # time ./m.alpine real 0m 10.75s user 0m 10.68s sys 0m 0.01s __libc_start_main?(72.495% 6012/8293) malloc_usable_size?(76.630% 4607/6012) main(22.904% 1377/6012) malloc_usable_size?(18.823% 1561/8293) exit?(8.610% 714/8293) David [-- Attachment #2: Type: text/html, Size: 14461 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 13:54 ` Siddhesh Poyarekar 2022-09-20 16:59 ` James Y Knight @ 2022-09-20 17:28 ` baiyang 2022-09-20 17:44 ` Siddhesh Poyarekar 2022-10-10 14:13 ` Florian Weimer 2 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-20 17:28 UTC (permalink / raw) To: Siddhesh Poyarekar, Florian Weimer; +Cc: Rich Felker, musl [-- Attachment #1: Type: text/plain, Size: 2624 bytes --] > Is there any other valid reason to use > malloc_usable_size instead of simply using realloc? Yes, we use it as one of the parameters to estimate the memory copy cost when realloc degenerates back to malloc+memcpy+free, refer to the previous mails. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Siddhesh Poyarekar Date: 2022-09-20 21:54 To: Florian Weimer CC: Rich Felker; baiyang; musl Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote: > The compiler needs to treat malloc_usable_size similar to realloc and > just the size information for the buffer based on the return value from > malloc_usable_size. This is admittedly harder to do than a comparable > analysis for realloc if the compiler interprets the standard in such a > way that after a successful realloc, any access to the original pointer > value is undefined. > > malloc_usable_size is not actually *that* useful with allocators that do > not have strict size classes because they do not over-allocate that > much. For these allocators, it may be possible to increase the size of > allocation significantly without moving it, but that is not reflected in > the return value of malloc_usable_size at all. So the glibc manual does not document malloc_usable_size semantics (which is weird since it is, well, a GNU extension!) so the only reference users have is the man page. The man page already discourages use of malloc_usable_size to write beyond the allocated size in the NOTES section: The value returned by malloc_usable_size() may be greater than the requested size of the allocation because of alignment and minimum size constraints. Although the excess bytes can be overwritten by the application without ill effects, this is not good programming practice: the number of excess bytes in an allocation depends on the underlying implementation. Adding support for something that's already declared as bad programming practice seems like a step backwards. Instead, I hope we find a way to discourage active use of malloc_usable_size more strongly. At least based on the systemd experience, the problem they try to solve is that of glibc realloc being too slow for paths where the reallocation should return the same block and that should be easy to special-case. Is there any other valid reason to use malloc_usable_size instead of simply using realloc? Sid [-- Attachment #2: Type: text/html, Size: 5037 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 17:28 ` baiyang @ 2022-09-20 17:44 ` Siddhesh Poyarekar 0 siblings, 0 replies; 68+ messages in thread From: Siddhesh Poyarekar @ 2022-09-20 17:44 UTC (permalink / raw) To: baiyang; +Cc: Florian Weimer, Rich Felker, musl On Tue, Sep 20, 2022 at 1:28 PM baiyang <baiyang@gmail.com> wrote: > > > Is there any other valid reason to use > > malloc_usable_size instead of simply using realloc? > > Yes, we use it as one of the parameters to estimate the memory copy cost when realloc degenerates back to malloc+memcpy+free, refer to the previous mails. I was only pulled into this conversation with the email I responded to; I am not subscribed to the musl mailing list. If this estimation is not for diagnostics then it's really not that different from the case I'm talking about. You shouldn't have to do this check and realloc should do the right thing whenever there's space available in the chunk. I can't comment on musl-specific bits because I haven't taken a close enough look at mallocng to do that. In general though, I don't think it makes sense for a C library to support malloc_usable_size beyond diagnostics. Sid ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-20 13:54 ` Siddhesh Poyarekar 2022-09-20 16:59 ` James Y Knight 2022-09-20 17:28 ` baiyang @ 2022-10-10 14:13 ` Florian Weimer 2 siblings, 0 replies; 68+ messages in thread From: Florian Weimer @ 2022-10-10 14:13 UTC (permalink / raw) To: Siddhesh Poyarekar; +Cc: Rich Felker, baiyang, musl * Siddhesh Poyarekar: > On Tue, Sep 20, 2022 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote: >> The compiler needs to treat malloc_usable_size similar to realloc and >> just the size information for the buffer based on the return value from >> malloc_usable_size. This is admittedly harder to do than a comparable >> analysis for realloc if the compiler interprets the standard in such a >> way that after a successful realloc, any access to the original pointer >> value is undefined. >> >> malloc_usable_size is not actually *that* useful with allocators that do >> not have strict size classes because they do not over-allocate that >> much. For these allocators, it may be possible to increase the size of >> allocation significantly without moving it, but that is not reflected in >> the return value of malloc_usable_size at all. > > So the glibc manual does not document malloc_usable_size semantics > (which is weird since it is, well, a GNU extension!) I think we got it via dlmalloc, which says this: /* malloc_usable_size(void* p); Returns the number of bytes you can actually use in an allocated chunk, which may be more than you requested (although often not) due to alignment and minimum size constraints. You can use this many bytes without worrying about overwriting other allocated objects. This is not a particularly great programming practice. malloc_usable_size can be more useful in debugging and assertions, for example: p = malloc(n); assert(malloc_usable_size(p) >= 256); */ I don't think it's a GNU invention. The GNU malloc used to be Mike Haertel's malloc, as far as I can tell, and that didn't have the function. GNU malloc had a malloc_object_allocation_size function at one point, it seems, but I don't know if that was before it was a rebranded dlmalloc. The malloc subsystem wasn't part of the glibc CVS at the time. > Adding support for something that's already declared as bad > programming practice seems like a step backwards. Instead, I hope we > find a way to discourage active use of malloc_usable_size more > strongly. At least based on the systemd experience, the problem they > try to solve is that of glibc realloc being too slow for paths where > the reallocation should return the same block and that should be easy > to special-case. Is there any other valid reason to use > malloc_usable_size instead of simply using realloc? Do you know which case systemd tries to optimize? Increasing the allocation size or lowering it? Maybe we should define a non-copying version of realloc that if that's what programmers want. I've got a write-up somewhere what an a replacement for class-free allocators of jemalloc's xallocx function would look like. I think it has to involve arithmetic progressions (but no L-series), so it's not pretty. Anyway, if there is no real use case for malloc_usable_size with the current glibc realloc implementation, we should really deprecate it. Especially since C++ is moving into an incompatible direction. Thanks, Florian ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang 2022-09-19 11:08 ` Szabolcs Nagy @ 2022-09-19 13:43 ` Rich Felker 2022-09-19 17:32 ` baiyang 1 sibling, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 13:43 UTC (permalink / raw) To: baiyang; +Cc: musl On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote: > Hi there, > > As we have discussed at > https://github.com/openwrt/openwrt/issues/10752. The > malloc_usable_size() function in musl 1.2 (mallocng) seems to have > some performance issues. > > It caused realloc and free spends too long time for get the chunk size. > > As we mentioned in the discussion, tcmalloc and some other > allocators can also accurately obtain the size class corresponding > to a memory block and its precise size, and it is also very fast at > the same time. > > Can we make some improvements to the existing malloc_usable_size > algorithm in mallocng? This should significantly improve the > performance of existing algorithms. Can you please start from a point of identifying the real-world case in which you're hitting a performance degredation? Made-up tests are generally not helpful and will almost always lead to focusing on the wrong problem. For now I'm going to focus on some things from the linked thread: > > Considering that realloc itself contains a complete > > malloc_usable_size (refer to here and here), So actually most > > (66.7%) of the realloc time is spent doing malloc_usable_size. In your test that increments the realloc size by one each iteration, only one in every PAGESIZE calls has any real work to do. The rest do nothing but set_size after obtaining the metadata on the object they're acting on. It's completely expected that the runtime of these will be dominated by obtaining the metadata; this isn't evidence of anything wrong. And, moreover, it's almost surely a lot more than 66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap syscalls and page faults accessing the new pages they produce. > > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac > > os x (malloc_size), and musl 1.1, even on low-end embedded > > processors, the consumption of malloc_usable_size per 10 million > > calls is mostly not more than a few hundred milliseconds. It looks like mallocng's malloc_usable_size is taking around 150 ns per call on your system, vs maybe 30-50 for others? > > In addition, this very slow slab size acquisition algorithm also > > needs to be called every time free (see here). So we believe it > > should be the main reason for malloc/free and realloc performance > > degradation in version 1.2. Unless you have an application that's explicitly using malloc_usable_size all over the place, it's highly unlikely that this is the cause of your real-world performance problems. The vast majority of reported problems with malloc performance have been in multithreaded applications, where the dominating time cost is fundamental: synchronization cost of having global consistency. There you'll expect to find very similar performance figures from any other allocator with global consistency, such as hardened_malloc. If you're really having single-threaded performance problems that aren't just in made-up benchmarks, please see if you can narrow down the cause empirically rather than speculatively. For example, running the program under perf and looking at where the time is being spent. > > If we can improve its speed and make it close to implementations > > like tcmalloc (tcmalloc can also accurately return the size of the > > size class to which the chunk belongs), it should significantly > > improve the performance of mallocng (at least in single-threaded > > scenarios) . tcmalloc is fast by not having global consistency, not being hardened against memory errors like double-free and use-after-free, and not avoiding fragmentation and excessive memory usage. Likewise for most of the others. The run time costs in mallocng for looking up the out-of-band metadata are largely fundamental to it being out-of-band (not subject to direct falsification via typically exploitable application bugs), size-efficient, 32-bit-compatible, nommu-compatible, etc. Other approaches like in hardened_malloc can be moderately more efficient to access the metadata, at the price of not being at all amenable to small systems, which are a core goal of musl we can't really disregard. I can't say for sure there's not any room for optimization in the metadata fetching though. Looking at the assembly output might be informative, to see if we're doing anything that's making the compiler emit gratuitously inefficient code. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 13:43 ` Rich Felker @ 2022-09-19 17:32 ` baiyang 2022-09-19 18:15 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-19 17:32 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 6175 bytes --] Hi Rich, Thanks for your reply. > Unless you have an application that's explicitly using > malloc_usable_size all over the place, it's highly unlikely that this > is the cause of your real-world performance problems. 1. Yes, we have a real scenario where `malloc_usable_size` is called frequently: we need to optimize the realloc experience. We add an extra parameter to realloc - minimalCopyBytes: it represents the actual size of data that needs to be copied after fallback to malloc-copy-free mode. We will judge whether to call realloc or complete malloc-memcpy-free by ourself based on factors such as the size of the data that realloc needs to copy (obtained through `malloc_usable_size`), the size that we actually need to copy when we doing malloc-memcpy-free ourself (minimalCopyBytes) and the chance of merging chunks (small blocks) or mremap (large blocks) in the underlayer realloc. So, this is a real scenario, we need to call `malloc_usable_size` frequently. 2. As I mentioned before, this isn't just a problem with `malloc_usable_size`, since we actually include a full `malloc_usable_size` procedure in both `realloc` and `free`, it actually slows down The speed of other calls such as `free` and `realloc`. So this problem actually slows down not only the `malloc_usable_size` call itself, but also the realloc and free calls. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-19 21:43 To: baiyang CC: musl Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote: > Hi there, > > As we have discussed at > https://github.com/openwrt/openwrt/issues/10752. The > malloc_usable_size() function in musl 1.2 (mallocng) seems to have > some performance issues. > > It caused realloc and free spends too long time for get the chunk size. > > As we mentioned in the discussion, tcmalloc and some other > allocators can also accurately obtain the size class corresponding > to a memory block and its precise size, and it is also very fast at > the same time. > > Can we make some improvements to the existing malloc_usable_size > algorithm in mallocng? This should significantly improve the > performance of existing algorithms. Can you please start from a point of identifying the real-world case in which you're hitting a performance degredation? Made-up tests are generally not helpful and will almost always lead to focusing on the wrong problem. For now I'm going to focus on some things from the linked thread: > > Considering that realloc itself contains a complete > > malloc_usable_size (refer to here and here), So actually most > > (66.7%) of the realloc time is spent doing malloc_usable_size. In your test that increments the realloc size by one each iteration, only one in every PAGESIZE calls has any real work to do. The rest do nothing but set_size after obtaining the metadata on the object they're acting on. It's completely expected that the runtime of these will be dominated by obtaining the metadata; this isn't evidence of anything wrong. And, moreover, it's almost surely a lot more than 66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap syscalls and page faults accessing the new pages they produce. > > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac > > os x (malloc_size), and musl 1.1, even on low-end embedded > > processors, the consumption of malloc_usable_size per 10 million > > calls is mostly not more than a few hundred milliseconds. It looks like mallocng's malloc_usable_size is taking around 150 ns per call on your system, vs maybe 30-50 for others? > > In addition, this very slow slab size acquisition algorithm also > > needs to be called every time free (see here). So we believe it > > should be the main reason for malloc/free and realloc performance > > degradation in version 1.2. Unless you have an application that's explicitly using malloc_usable_size all over the place, it's highly unlikely that this is the cause of your real-world performance problems. The vast majority of reported problems with malloc performance have been in multithreaded applications, where the dominating time cost is fundamental: synchronization cost of having global consistency. There you'll expect to find very similar performance figures from any other allocator with global consistency, such as hardened_malloc. If you're really having single-threaded performance problems that aren't just in made-up benchmarks, please see if you can narrow down the cause empirically rather than speculatively. For example, running the program under perf and looking at where the time is being spent. > > If we can improve its speed and make it close to implementations > > like tcmalloc (tcmalloc can also accurately return the size of the > > size class to which the chunk belongs), it should significantly > > improve the performance of mallocng (at least in single-threaded > > scenarios) . tcmalloc is fast by not having global consistency, not being hardened against memory errors like double-free and use-after-free, and not avoiding fragmentation and excessive memory usage. Likewise for most of the others. The run time costs in mallocng for looking up the out-of-band metadata are largely fundamental to it being out-of-band (not subject to direct falsification via typically exploitable application bugs), size-efficient, 32-bit-compatible, nommu-compatible, etc. Other approaches like in hardened_malloc can be moderately more efficient to access the metadata, at the price of not being at all amenable to small systems, which are a core goal of musl we can't really disregard. I can't say for sure there's not any room for optimization in the metadata fetching though. Looking at the assembly output might be informative, to see if we're doing anything that's making the compiler emit gratuitously inefficient code. Rich [-- Attachment #2: Type: text/html, Size: 10894 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 17:32 ` baiyang @ 2022-09-19 18:15 ` Rich Felker 2022-09-19 18:44 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 18:15 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 01:32:31AM +0800, baiyang wrote: > Hi Rich, > > Thanks for your reply. > > > Unless you have an application that's explicitly using > > malloc_usable_size all over the place, it's highly unlikely that this > > is the cause of your real-world performance problems. > > 1. Yes, we have a real scenario where `malloc_usable_size` is called > frequently: we need to optimize the realloc experience. We add an > extra parameter to realloc - minimalCopyBytes: it represents the > actual size of data that needs to be copied after fallback to > malloc-copy-free mode. We will judge whether to call realloc or > complete malloc-memcpy-free by ourself based on factors such as the > size of the data that realloc needs to copy (obtained through > `malloc_usable_size`), the size that we actually need to copy when > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the > chance of merging chunks (small blocks) or mremap (large blocks) in > the underlayer realloc. So, this is a real scenario, we need to call > `malloc_usable_size` frequently. Is there a reason you're relying on an unreliable and nonstandard function (malloc_usable_size) to do this rather than your program keeping track of its own knowledge of the allocated size? This is what the C language expects you to do. For example if you have a structure that contains a pointer to a dynamically sized buffer, normally you store the size in a size_t member right next to that pointer, allowing you to make these kind of decisions without having to probe anything. > 2. As I mentioned before, this isn't just a problem with > `malloc_usable_size`, since we actually include a full > `malloc_usable_size` procedure in both `realloc` and `free`, it > actually slows down The speed of other calls such as `free` and > `realloc`. So this problem actually slows down not only the > `malloc_usable_size` call itself, but also the realloc and free > calls. If this is affecting you too, that's a separate issue. But I can't tell from what you've reported so far whether you're just claiming this on a theoretical basis or whether you're actually experiencing unacceptable performance. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 18:15 ` Rich Felker @ 2022-09-19 18:44 ` baiyang 2022-09-19 19:18 ` Rich Felker 0 siblings, 1 reply; 68+ messages in thread From: baiyang @ 2022-09-19 18:44 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 4190 bytes --] > Is there a reason you're relying on an unreliable and nonstandard > function (malloc_usable_size) to do this rather than your program > keeping track of its own knowledge of the allocated size? This is what > the C language expects you to do. For example if you have a structure > that contains a pointer to a dynamically sized buffer, normally you > store the size in a size_t member right next to that pointer, allowing > you to make these kind of decisions without having to probe anything. Yes, as I have been said, by comparing the number of bytes that realloc needs to copy in the worst case (the return value of malloc_usable_size), and the number of bytes we actually need to copy, we can optimize the performance of realloc in real scenarios and avoid unnecessary memory copies. In fact, in scenarios including glibc, tcmalloc, windows crt, mac os x, uclibc and musl 1.1, we did achieve good optimization results. On the other hand, of course we keep the number of bytes actually allocated, but it doesn't really reflect objectively the number of bytes to be copied by realloc when the memcpy actually occurs. And malloc_usable_size() more accurately reflects how many bytes realloc needs to copy when it degenerates back to malloc-memcpy-free mode. So our expectation is as mentioned in the man page for linux, mac os or windows: "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation" or "The memory block size is always at least as large as the allocation it backs, **and may be larger**." - We expect to get its internal size to evaluate the cost of memory copying. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 02:15 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 01:32:31AM +0800, baiyang wrote: > Hi Rich, > > Thanks for your reply. > > > Unless you have an application that's explicitly using > > malloc_usable_size all over the place, it's highly unlikely that this > > is the cause of your real-world performance problems. > > 1. Yes, we have a real scenario where `malloc_usable_size` is called > frequently: we need to optimize the realloc experience. We add an > extra parameter to realloc - minimalCopyBytes: it represents the > actual size of data that needs to be copied after fallback to > malloc-copy-free mode. We will judge whether to call realloc or > complete malloc-memcpy-free by ourself based on factors such as the > size of the data that realloc needs to copy (obtained through > `malloc_usable_size`), the size that we actually need to copy when > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the > chance of merging chunks (small blocks) or mremap (large blocks) in > the underlayer realloc. So, this is a real scenario, we need to call > `malloc_usable_size` frequently. Is there a reason you're relying on an unreliable and nonstandard function (malloc_usable_size) to do this rather than your program keeping track of its own knowledge of the allocated size? This is what the C language expects you to do. For example if you have a structure that contains a pointer to a dynamically sized buffer, normally you store the size in a size_t member right next to that pointer, allowing you to make these kind of decisions without having to probe anything. > 2. As I mentioned before, this isn't just a problem with > `malloc_usable_size`, since we actually include a full > `malloc_usable_size` procedure in both `realloc` and `free`, it > actually slows down The speed of other calls such as `free` and > `realloc`. So this problem actually slows down not only the > `malloc_usable_size` call itself, but also the realloc and free > calls. If this is affecting you too, that's a separate issue. But I can't tell from what you've reported so far whether you're just claiming this on a theoretical basis or whether you're actually experiencing unacceptable performance. [-- Attachment #2: Type: text/html, Size: 8441 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 18:44 ` baiyang @ 2022-09-19 19:18 ` Rich Felker 2022-09-19 19:45 ` baiyang 0 siblings, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 19:18 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 02:44:58AM +0800, baiyang wrote: > > Is there a reason you're relying on an unreliable and nonstandard > > function (malloc_usable_size) to do this rather than your program > > keeping track of its own knowledge of the allocated size? This is what > > the C language expects you to do. For example if you have a structure > > that contains a pointer to a dynamically sized buffer, normally you > > store the size in a size_t member right next to that pointer, allowing > > you to make these kind of decisions without having to probe anything. > > Yes, as I have been said, by comparing the number of bytes that > realloc needs to copy in the worst case (the return value of > malloc_usable_size), and the number of bytes we actually need to > copy, we can optimize the performance of realloc in real scenarios > and avoid unnecessary memory copies. You can do exactly the same keeping track of the size yourself. The only correct value malloc_usable_size can return is the value you passed to the allocator. If it's returning a different size, your malloc implementation has a problem that will make you commit UB when you use the result of malloc_usable_size. Many real-world ones do have this problem. > In fact, in scenarios including glibc, tcmalloc, windows crt, mac os > x, uclibc and musl 1.1, we did achieve good optimization results. > > On the other hand, of course we keep the number of bytes actually > allocated, but it doesn't really reflect objectively the number of > bytes to be copied by realloc when the memcpy actually occurs. And > malloc_usable_size() more accurately reflects how many bytes realloc > needs to copy when it degenerates back to malloc-memcpy-free mode. I don't understand your claim here. The size you would store is exactly the size that realloc would have to copy. It if copies more, that's just realloc being inefficient, but the difference is not going to be material anyway. > So our expectation is as mentioned in the man page for linux, mac os > or windows: "The value returned by malloc_usable_size() may be > **greater than** the requested size of the allocation" or "The > memory block size is always at least as large as the allocation it > backs, **and may be larger**." - We expect to get its internal size > to evaluate the cost of memory copying. It's sounding more and more like you did premature optimization without measuring any of this, since there is *no way* the possible amount of excess copying a realloc implementation might make internally could cost more than an extra external function call to malloc_usable_size (even if it did nothing but return). Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 19:18 ` Rich Felker @ 2022-09-19 19:45 ` baiyang 2022-09-19 20:07 ` Rich Felker 2022-09-19 20:17 ` Joakim Sindholt 0 siblings, 2 replies; 68+ messages in thread From: baiyang @ 2022-09-19 19:45 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 6207 bytes --] > The only correct value malloc_usable_size can return is the value you passed to the allocator. I don't think so, see: Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation". Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**." FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation". These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation. > It's sounding more and more like you did premature optimization > without measuring any of this, since there is *no way* the possible > amount of excess copying a realloc implementation might make > internally could cost more than an extra external function call to > malloc_usable_size (even if it did nothing but return). As I said before: > We have a real scenario where `malloc_usable_size` is called frequently: we need to optimize the realloc experience. We add an extra parameter to realloc - minimalCopyBytes: it represents the actual size of data that needs to be copied after fallback to malloc-copy-free mode. We will judge whether to call realloc or complete malloc-memcpy-free by ourself based on factors such as the size of the data that realloc needs to copy (obtained through `malloc_usable_size`), the size that we actually need to copy when we doing malloc-memcpy-free ourself (minimalCopyBytes) and the chance of merging chunks (small blocks) or mremap (large blocks) in the underlayer realloc. So, this is a real scenario, we need to call `malloc_usable_size` frequently. Example: We allocate a block of 500KB (malloc actually allocated 512KB) and want to extend it to 576KB via realloc. At this point realloc may downgrade back to the inefficient malloc(756KB), memcpy(512KB) and free(512KB) modes. But the real situation at this time may be that we only need to keep the first 4KB of content in 500KB, so we comprehensively evaluate the cost (including the possibility of realloc using block merging like in musl 1.1, and techniques like mremap to avoid copying) to decide whether to complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves are more cost-effective. Such optimizations have measurable and significant effects on our practical applications in each of the above libc environments. In this scenario, we need to get the 512KB actually allocated by malloc through malloc_usable_size instead of the 500KB length we saved ourselves. Thanks :-) -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 03:18 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 02:44:58AM +0800, baiyang wrote: > > Is there a reason you're relying on an unreliable and nonstandard > > function (malloc_usable_size) to do this rather than your program > > keeping track of its own knowledge of the allocated size? This is what > > the C language expects you to do. For example if you have a structure > > that contains a pointer to a dynamically sized buffer, normally you > > store the size in a size_t member right next to that pointer, allowing > > you to make these kind of decisions without having to probe anything. > > Yes, as I have been said, by comparing the number of bytes that > realloc needs to copy in the worst case (the return value of > malloc_usable_size), and the number of bytes we actually need to > copy, we can optimize the performance of realloc in real scenarios > and avoid unnecessary memory copies. You can do exactly the same keeping track of the size yourself. The only correct value malloc_usable_size can return is the value you passed to the allocator. If it's returning a different size, your malloc implementation has a problem that will make you commit UB when you use the result of malloc_usable_size. Many real-world ones do have this problem. > In fact, in scenarios including glibc, tcmalloc, windows crt, mac os > x, uclibc and musl 1.1, we did achieve good optimization results. > > On the other hand, of course we keep the number of bytes actually > allocated, but it doesn't really reflect objectively the number of > bytes to be copied by realloc when the memcpy actually occurs. And > malloc_usable_size() more accurately reflects how many bytes realloc > needs to copy when it degenerates back to malloc-memcpy-free mode. I don't understand your claim here. The size you would store is exactly the size that realloc would have to copy. It if copies more, that's just realloc being inefficient, but the difference is not going to be material anyway. > So our expectation is as mentioned in the man page for linux, mac os > or windows: "The value returned by malloc_usable_size() may be > **greater than** the requested size of the allocation" or "The > memory block size is always at least as large as the allocation it > backs, **and may be larger**." - We expect to get its internal size > to evaluate the cost of memory copying. It's sounding more and more like you did premature optimization without measuring any of this, since there is *no way* the possible amount of excess copying a realloc implementation might make internally could cost more than an extra external function call to malloc_usable_size (even if it did nothing but return). Rich [-- Attachment #2: Type: text/html, Size: 14282 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 19:45 ` baiyang @ 2022-09-19 20:07 ` Rich Felker 2022-09-19 20:17 ` baiyang 2022-09-19 20:17 ` Joakim Sindholt 1 sibling, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 20:07 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 03:45:35AM +0800, baiyang wrote: > > The only correct value malloc_usable_size can return is the value > > you passed to the allocator. > > I don't think so, see: > > Linux man page: > https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - > "The value returned by malloc_usable_size() may be **greater than** > the requested size of the allocation". > > Mac OS X man page: > https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html > - "The memory block size is always at least as large as the > allocation it backs, **and may be larger**." > > FreeBSD man page: > https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html > - "The return value **may be larger** than the size that was > requested during allocation". > > These official man pages clearly state that the return value of > malloc_usable_size is the size of the memory block allocated > internally, not the size submitted by the user. > > Instead, we didn't find any documentation saying that the return > value of malloc_usable_size must be the size submitted by the user > to be correct. Please correct me if you have the relevant > documentation. OK, I didn't state that precisely. There are two conflicting claims for what the malloc_usable_size contract is. If it's allowed to return some value larger than the size you requested, then the size returned is not actually "usable" and there's basically nothing useful you can do with the function. > > It's sounding more and more like you did premature optimization > > without measuring any of this, since there is *no way* the possible > > amount of excess copying a realloc implementation might make > > internally could cost more than an extra external function call to > > malloc_usable_size (even if it did nothing but return). > > As I said before: > > We have a real scenario where `malloc_usable_size` is called > > frequently: we need to optimize the realloc experience. We add an > > extra parameter to realloc - minimalCopyBytes: it represents the > > actual size of data that needs to be copied after fallback to > > malloc-copy-free mode. We will judge whether to call realloc or > > complete malloc-memcpy-free by ourself based on factors such as > > the size of the data that realloc needs to copy (obtained through > > `malloc_usable_size`), the size that we actually need to copy when > > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the > > chance of merging chunks (small blocks) or mremap (large blocks) > > in the underlayer realloc. So, this is a real scenario, we need to > > call `malloc_usable_size` frequently. > > Example: We allocate a block of 500KB (malloc actually allocated > 512KB) and want to extend it to 576KB via realloc. At this point > realloc may downgrade back to the inefficient malloc(756KB), > memcpy(512KB) and free(512KB) modes. Clearly you did not measure this, because with basically any real-world malloc, it will call mremap and move the memory via MMU-level remapping, with no copying involved whatsoever. > But the real situation at this > time may be that we only need to keep the first 4KB of content in > 500KB, so we comprehensively evaluate the cost (including the > possibility of realloc using block merging like in musl 1.1, and > techniques like mremap to avoid copying) to decide whether to > complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves > are more cost-effective. You could have achieved exactly the same thing by keeping your own knowledge that you allocated 500kB. But it would still be significantly slower, because mmap+memcpy+munmap (2 syscalls) is slower than mremap (1 syscall). > Such optimizations have measurable and significant effects on our > practical applications in each of the above libc environments. > > In this scenario, we need to get the 512KB actually allocated by > malloc through malloc_usable_size instead of the 500KB length we > saved ourselves. No you don't. Either number works just as well (or rather just as poorly). Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:07 ` Rich Felker @ 2022-09-19 20:17 ` baiyang 2022-09-19 20:28 ` Rich Felker 2022-09-19 22:02 ` Quentin Rameau 0 siblings, 2 replies; 68+ messages in thread From: baiyang @ 2022-09-19 20:17 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 5609 bytes --] > Clearly you did not measure this, because with basically any real-world malloc, it will call mremap and move the memory via MMU-level remapping, with no copying involved whatsoever. OK, I've been said multiple times: > ...and the chance of merging chunks (small blocks) or **mremap** (large blocks) in the underlayer realloc. > ...evaluate the cost (including the possibility of realloc using block merging like in musl 1.1, and techniques like **mremap** to avoid copying) Therefore, we determined that the possibility of each memory allocator calling mremap in different situations was specifically considered on the LINUX platform. And I mentioned it before: we did massively optimize performance in real-world applications. These are not the focus of our discussion. So now it is certain that in musl mallocng: 1. malloc_usable_size will always just return the size submitted by the user to malloc, not the actual size allocated inside it, right? 2. We have no plans to improve malloc_usable_size performance yet, right? thanks -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 04:07 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 03:45:35AM +0800, baiyang wrote: > > The only correct value malloc_usable_size can return is the value > > you passed to the allocator. > > I don't think so, see: > > Linux man page: > https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - > "The value returned by malloc_usable_size() may be **greater than** > the requested size of the allocation". > > Mac OS X man page: > https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html > - "The memory block size is always at least as large as the > allocation it backs, **and may be larger**." > > FreeBSD man page: > https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html > - "The return value **may be larger** than the size that was > requested during allocation". > > These official man pages clearly state that the return value of > malloc_usable_size is the size of the memory block allocated > internally, not the size submitted by the user. > > Instead, we didn't find any documentation saying that the return > value of malloc_usable_size must be the size submitted by the user > to be correct. Please correct me if you have the relevant > documentation. OK, I didn't state that precisely. There are two conflicting claims for what the malloc_usable_size contract is. If it's allowed to return some value larger than the size you requested, then the size returned is not actually "usable" and there's basically nothing useful you can do with the function. > > It's sounding more and more like you did premature optimization > > without measuring any of this, since there is *no way* the possible > > amount of excess copying a realloc implementation might make > > internally could cost more than an extra external function call to > > malloc_usable_size (even if it did nothing but return). > > As I said before: > > We have a real scenario where `malloc_usable_size` is called > > frequently: we need to optimize the realloc experience. We add an > > extra parameter to realloc - minimalCopyBytes: it represents the > > actual size of data that needs to be copied after fallback to > > malloc-copy-free mode. We will judge whether to call realloc or > > complete malloc-memcpy-free by ourself based on factors such as > > the size of the data that realloc needs to copy (obtained through > > `malloc_usable_size`), the size that we actually need to copy when > > we doing malloc-memcpy-free ourself (minimalCopyBytes) and the > > chance of merging chunks (small blocks) or mremap (large blocks) > > in the underlayer realloc. So, this is a real scenario, we need to > > call `malloc_usable_size` frequently. > > Example: We allocate a block of 500KB (malloc actually allocated > 512KB) and want to extend it to 576KB via realloc. At this point > realloc may downgrade back to the inefficient malloc(756KB), > memcpy(512KB) and free(512KB) modes. Clearly you did not measure this, because with basically any real-world malloc, it will call mremap and move the memory via MMU-level remapping, with no copying involved whatsoever. > But the real situation at this > time may be that we only need to keep the first 4KB of content in > 500KB, so we comprehensively evaluate the cost (including the > possibility of realloc using block merging like in musl 1.1, and > techniques like mremap to avoid copying) to decide whether to > complete malloc(576KB), memcpy(**4KB**), free(512KB) by ourselves > are more cost-effective. You could have achieved exactly the same thing by keeping your own knowledge that you allocated 500kB. But it would still be significantly slower, because mmap+memcpy+munmap (2 syscalls) is slower than mremap (1 syscall). > Such optimizations have measurable and significant effects on our > practical applications in each of the above libc environments. > > In this scenario, we need to get the 512KB actually allocated by > malloc through malloc_usable_size instead of the 500KB length we > saved ourselves. No you don't. Either number works just as well (or rather just as poorly). Rich [-- Attachment #2: Type: text/html, Size: 10503 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:17 ` baiyang @ 2022-09-19 20:28 ` Rich Felker 2022-09-19 20:38 ` baiyang 2022-09-19 22:02 ` Quentin Rameau 1 sibling, 1 reply; 68+ messages in thread From: Rich Felker @ 2022-09-19 20:28 UTC (permalink / raw) To: baiyang; +Cc: musl On Tue, Sep 20, 2022 at 04:17:20AM +0800, baiyang wrote: > > Clearly you did not measure this, because with basically any > > real-world malloc, it will call mremap and move the memory via > > MMU-level remapping, with no copying involved whatsoever. > > OK, I've been said multiple times: > > ...and the chance of merging chunks (small blocks) or **mremap** > > (large blocks) in the underlayer realloc. To my knowledge there is no allocator which does chunk merging on 500k chunks in the default configuration. > > ...evaluate the cost (including the possibility of realloc using > > block merging like in musl 1.1, and techniques like **mremap** to > > avoid copying) musl 1.1.x never did block merging on objects of this size. > Therefore, we determined that the possibility of each memory > allocator calling mremap in different situations was specifically > considered on the LINUX platform. > > And I mentioned it before: we did massively optimize performance in > real-world applications. These are not the focus of our discussion. > > So now it is certain that in musl mallocng: > 1. malloc_usable_size will always just return the size submitted by > the user to malloc, not the actual size allocated inside it, right? Yes. > 2. We have no plans to improve malloc_usable_size performance yet, right? In the absence of a concrete, quantitative report of how it's impacting performance, definitely no. Even if there is such a report, if the source of the problem is that you're gratuitously trying to second-guess realloc and making your program slower, rather than just calling realloc, it's not a very interesting report. malloc_usable_size is *not* a function whose use is recommended and it's only there at all because we can't remove an interface once adding it. If the problem is actually the size determination in realloc and free, and if you've *measured* that and can make a quantitative report on how it affected *real world usage* not a made-up benchmark, then that is potentially actionable. Rich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:28 ` Rich Felker @ 2022-09-19 20:38 ` baiyang 0 siblings, 0 replies; 68+ messages in thread From: baiyang @ 2022-09-19 20:38 UTC (permalink / raw) To: Rich Felker; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 2964 bytes --] > To my knowledge there is no allocator which does chunk merging on 500k chunks in the default configuration. It can be ANY size, "500KB" is just an example. > musl 1.1.x never did block merging on objects of this size. Yes, we known it, and we will evaluate according to the specific situation of the operating system, libc implementation and allocation size. Again, this is not the point of this discussion. > In the absence of a concrete, quantitative report of how it's impacting performance, definitely no. OK, I understood and thanks for your time. -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-20 04:28 To: baiyang CC: musl Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 04:17:20AM +0800, baiyang wrote: > > Clearly you did not measure this, because with basically any > > real-world malloc, it will call mremap and move the memory via > > MMU-level remapping, with no copying involved whatsoever. > > OK, I've been said multiple times: > > ...and the chance of merging chunks (small blocks) or **mremap** > > (large blocks) in the underlayer realloc. To my knowledge there is no allocator which does chunk merging on 500k chunks in the default configuration. > > ...evaluate the cost (including the possibility of realloc using > > block merging like in musl 1.1, and techniques like **mremap** to > > avoid copying) musl 1.1.x never did block merging on objects of this size. > Therefore, we determined that the possibility of each memory > allocator calling mremap in different situations was specifically > considered on the LINUX platform. > > And I mentioned it before: we did massively optimize performance in > real-world applications. These are not the focus of our discussion. > > So now it is certain that in musl mallocng: > 1. malloc_usable_size will always just return the size submitted by > the user to malloc, not the actual size allocated inside it, right? Yes. > 2. We have no plans to improve malloc_usable_size performance yet, right? In the absence of a concrete, quantitative report of how it's impacting performance, definitely no. Even if there is such a report, if the source of the problem is that you're gratuitously trying to second-guess realloc and making your program slower, rather than just calling realloc, it's not a very interesting report. malloc_usable_size is *not* a function whose use is recommended and it's only there at all because we can't remove an interface once adding it. If the problem is actually the size determination in realloc and free, and if you've *measured* that and can make a quantitative report on how it affected *real world usage* not a made-up benchmark, then that is potentially actionable. Rich [-- Attachment #2: Type: text/html, Size: 6237 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:17 ` baiyang 2022-09-19 20:28 ` Rich Felker @ 2022-09-19 22:02 ` Quentin Rameau 1 sibling, 0 replies; 68+ messages in thread From: Quentin Rameau @ 2022-09-19 22:02 UTC (permalink / raw) To: musl Hi BaiYang, > And I mentioned it before: we did massively optimize performance in real-world applications. These are not the focus of our discussion. And now you're hitting a problem because those micro-non-standard optimizations that worked in some setup can't be expected to work in a standard way. This is something not surprising and quite usual with that kind of or programming orientation. You may know that musl paradigms are quite opposite to that, one being focused on correctness rather than being bug-compatible with other implementations. As Rich said, there always might be place for improvement in code, but that should be proven to work in the majority of cases, and not at the expanse of stability or correctness. I assume that if you can come up with such an improvement complying with those requirements, they would be honestly considered, but this present problem report have been answered several times and it seems that the solution to your application problems resides in your application optimizations optimization. Cheers! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 19:45 ` baiyang 2022-09-19 20:07 ` Rich Felker @ 2022-09-19 20:17 ` Joakim Sindholt 2022-09-19 20:33 ` baiyang 1 sibling, 1 reply; 68+ messages in thread From: Joakim Sindholt @ 2022-09-19 20:17 UTC (permalink / raw) To: musl On Tue, 20 Sep 2022 03:45:35 +0800, baiyang <baiyang@gmail.com> wrote: > > The only correct value malloc_usable_size can return is the value you passed to the allocator. > > I don't think so, see: > > Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation". > > Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**." > > FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation". > > These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. > > Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation. It's not that malloc_usable_size must return the size originally submitted by the user but that if it doesn't and you take that as an invitation to exceed the original size allocated you will hit UB. Simple case: https://gcc.godbolt.org/z/5E65rr95W Real world example: https://github.com/systemd/systemd/issues/22801 And the reason why is pretty simple: http://port70.net/~nsz/c/c11/n1570.html#7.22.3.4p2 > The malloc function allocates space for an object whose size is > specified by size and whose value is indeterminate. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) 2022-09-19 20:17 ` Joakim Sindholt @ 2022-09-19 20:33 ` baiyang 0 siblings, 0 replies; 68+ messages in thread From: baiyang @ 2022-09-19 20:33 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2436 bytes --] > you will hit UB Thanks for the information, but: 1. As we have discussed in other emails, we do not use malloc_usage_size as such. 2. This is most likely a problem with gcc's corresponding checking mechanism, rather than using glibc's malloc_usable_size() in this way, see: https://gcc.godbolt.org/z/qhqheTqcz -- Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn **** < END OF EMAIL > **** From: Joakim Sindholt Date: 2022-09-20 04:17 To: musl Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, 20 Sep 2022 03:45:35 +0800, baiyang <baiyang@gmail.com> wrote: > > The only correct value malloc_usable_size can return is the value you passed to the allocator. > > I don't think so, see: > > Linux man page: https://man7.org/linux/man-pages/man3/malloc_usable_size.3.html - "The value returned by malloc_usable_size() may be **greater than** the requested size of the allocation". > > Mac OS X man page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html - "The memory block size is always at least as large as the allocation it backs, **and may be larger**." > > FreeBSD man page: https://www.freebsd.org/cgi/man.cgi?query=malloc_usable_size&apropos=0&sektion=0&manpath=FreeBSD+7.1-RELEASE&format=html - "The return value **may be larger** than the size that was requested during allocation". > > These official man pages clearly state that the return value of malloc_usable_size is the size of the memory block allocated internally, not the size submitted by the user. > > Instead, we didn't find any documentation saying that the return value of malloc_usable_size must be the size submitted by the user to be correct. Please correct me if you have the relevant documentation. It's not that malloc_usable_size must return the size originally submitted by the user but that if it doesn't and you take that as an invitation to exceed the original size allocated you will hit UB. Simple case: https://gcc.godbolt.org/z/5E65rr95W Real world example: https://github.com/systemd/systemd/issues/22801 And the reason why is pretty simple: http://port70.net/~nsz/c/c11/n1570.html#7.22.3.4p2 > The malloc function allocates space for an object whose size is > specified by size and whose value is indeterminate. [-- Attachment #2: Type: text/html, Size: 4786 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2022-10-10 14:13 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-19 7:53 [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) baiyang 2022-09-19 11:08 ` Szabolcs Nagy 2022-09-19 12:36 ` Florian Weimer 2022-09-19 13:46 ` Rich Felker 2022-09-19 13:53 ` James Y Knight 2022-09-19 17:40 ` baiyang 2022-09-19 18:14 ` Szabolcs Nagy 2022-09-19 18:40 ` baiyang 2022-09-19 19:07 ` Gabriel Ravier 2022-09-19 19:21 ` Rich Felker 2022-09-19 21:02 ` Gabriel Ravier 2022-09-19 21:47 ` Rich Felker 2022-09-19 22:31 ` Gabriel Ravier 2022-09-19 22:46 ` baiyang 2022-09-19 20:46 ` Nat! 2022-09-20 8:51 ` Szabolcs Nagy 2022-09-20 0:13 ` James Y Knight 2022-09-20 0:25 ` baiyang 2022-09-20 0:38 ` Rich Felker 2022-09-20 0:47 ` baiyang 2022-09-20 1:00 ` Rich Felker 2022-09-20 1:18 ` baiyang 2022-09-20 2:15 ` Rich Felker 2022-09-20 2:35 ` baiyang 2022-09-20 3:28 ` Rich Felker 2022-09-20 3:53 ` baiyang 2022-09-20 5:41 ` Rich Felker 2022-09-20 5:56 ` baiyang 2022-09-20 12:16 ` Rich Felker 2022-09-20 17:21 ` baiyang 2022-09-20 8:33 ` Florian Weimer 2022-09-20 13:54 ` Siddhesh Poyarekar 2022-09-20 16:59 ` James Y Knight 2022-09-20 17:34 ` Szabolcs Nagy 2022-09-20 19:53 ` James Y Knight 2022-09-24 8:55 ` Fangrui Song 2022-09-20 17:39 ` baiyang 2022-09-20 18:12 ` Quentin Rameau 2022-09-20 18:19 ` Rich Felker 2022-09-20 18:26 ` Alexander Monakov 2022-09-20 18:35 ` baiyang 2022-09-20 20:33 ` Gabriel Ravier 2022-09-20 20:45 ` baiyang 2022-09-21 8:42 ` NRK 2022-09-20 18:37 ` Quentin Rameau 2022-09-21 10:15 ` [musl] " 王志强 2022-09-21 16:11 ` [musl] " 王志强 2022-09-21 17:15 ` [musl] " Rich Felker 2022-09-21 17:58 ` Rich Felker 2022-09-22 3:34 ` [musl] " 王志强 2022-09-22 9:10 ` [musl] " 王志强 2022-09-22 9:39 ` [musl] " 王志强 2022-09-20 17:28 ` baiyang 2022-09-20 17:44 ` Siddhesh Poyarekar 2022-10-10 14:13 ` Florian Weimer 2022-09-19 13:43 ` Rich Felker 2022-09-19 17:32 ` baiyang 2022-09-19 18:15 ` Rich Felker 2022-09-19 18:44 ` baiyang 2022-09-19 19:18 ` Rich Felker 2022-09-19 19:45 ` baiyang 2022-09-19 20:07 ` Rich Felker 2022-09-19 20:17 ` baiyang 2022-09-19 20:28 ` Rich Felker 2022-09-19 20:38 ` baiyang 2022-09-19 22:02 ` Quentin Rameau 2022-09-19 20:17 ` Joakim Sindholt 2022-09-19 20:33 ` baiyang
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).