On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote:

> > The ones that return some value larger than the requested size are

> > returning "the requested size, rounded up to a multiple of 16" or

> > similar. Not "the requested size plus 1500 bytes".

> ...

> > They don't return 8100. They return something like 6608 or 6624.

> No, AFAIK, There are many allocators whose return value of

> malloc_usable_size is 1KB (or more) larger than the requested value

> at malloc time.

> For Example: if you do "void* p = malloc(6700)" on tcmalloc, then

> "malloc_usable_size(p)" will return **8192**. Far more than just

> "rounded up to a multiple of 16".

OK, thanks for checking and correcting.

> > This does not follow at all. tcmalloc is fast because it does not have

> > global consistency, does not have any notable hardening, and (see the

> > name) keeps large numbers of freed slots *cached* to reuse, thereby

> > using lots of extra memory. Its malloc_usable_size is not fast because

> > of returning the wrong value, if it even does return the wrong value

> > (I have no idea).

> We don't need to refer to these features of tcmalloc, we only need

> to refer to its malloc_usable_size algorithm.

Those (mis)features are what provide a fast path here, regardless of

whether you care about them.

> > It's fast because they store the size in-band right

> > next to the allocated memory and trust that it's valid, rather than

> > computing it from out-of-band metadata that is not subject to

> > falsification unless the attacker already has nearly full control of

> > execution.

> No, if I understand correctly, tcmalloce doesn't store the size

> in-band right next to the allocated memory. On the contrary, when

> executing malloc_usable_size(p) (actually GetSize(p)), it will first

> find the size class corresponding to p through a quick lookup table,

> and then return the length of the size class. See:

> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099

OK, I was confusing tcmalloc with the more conventional "thread-local

freelist caching on top of dlmalloc type base" allocator strategy.

Indeed tcmalloc however is one of the gigantic ones.

> My understanding: the biggest impediment to our inability to apply

> similar optimizations is that we have to return 6700, not 8192 (of

> course, you've denied this is the reason).

Your understanding is wrong. I've told you how you can measure that

it's wrong. You insist on being stuck on it for no good reason.

If you want to understand *why* tcmalloc is different, start with the

comments at the top of the file you linked:

> // 4. The pagemap (which maps from page-number to descriptor),

> // can be read without holding any locks, and written while holding

> // the "pageheap_lock".

> //

> // This multi-threaded access to the pagemap is safe for fairly

> // subtle reasons. We basically assume that when an object X is

> // allocated by thread A and deallocated by thread B, there must

> // have been appropriate synchronization in the handoff of object

> // X from thread A to thread B.

This is the kind of thing I mean by lack of global consistency (no

synchronization around access to these data structures) and lack of

any meaningful hardening (*assuming* no memory lifetime usage errors

in the calling application).

The GetSize function you cited uses this global pagemap to go straight

from a page address to a sizeclass, via what amounts to a two-level or

three-level table indexed by upper bits of the address (comment says

3-level is only used in slower but lower-mem-use configuration). These

tables, at least in the 2-level form, are utterly *massive*. I'm not

sure if it creates them PROT_NONE and then only instantiates real

memory for the (initially fairly sparse) parts that get used, or if it

just allocates these giant things relying on overcommit, but either

way that's not compatible with small-memory-space systems or with

nommu.

On top of that, this approach relies on laying out whole pages (likely

large slabs of many pages at a time) of identical-sized objects so

that the size and other properties can be looked up by page number. I

have not looked into the details of "how bad" it gets, but it

completely precludes having any small processes, and precludes

promptly returning freed memory to the system, since *changing* the

pagemap is going to be costly and they're going to avoid doing it

(note the above comment on locking).

mallocng does not have any global mapping optimizing translation from

addresses to groups/metadata objects. Because we insist on global

consistency (a prerequisite for being able to define strong hardening

properties) and on being able to return freed memory promptly to the

system, maintaining such a data structure would cost a lot more time

(performance) than anything it could give, and it would make lock-free

operations (like your malloc_usable_size, or trivial realloc calls)

potentially require locking.

Instead of using the numeric value of the address to map to metadata,

we chase offsets from the object base address to the metadata, then

validate that it round-trips back to conclude that we didn't just

follow random junk from the caller passing an invalid/dangling pointer

in, or from this data being overwritten via heap-based buffer

overflows.

Fundamentally, this pointer chasing is going to be a little bit more

expensive than just using address bits as table indices, but not

really all that much. At least half of the cost difference, and

probably a lot more, is not the pointer/offset chasing but the

validation (hardening). If hypothetically you wanted to turn that all

off (e.g. by defining the assert macro to a no-op) you could have it

be a lot faster, and still have low memory usage too. I'm not sure but

for single-threaded loads I would not be surprised if it were getting

close to tcmalloc speed. Just casually building with assert() defined

to nop out the tests, I got double performance on your TEST2. Of

course, I don't recommend doing this. But it's an interesting test for

performing *measurement* (which you so far refuse to do) of what's

actually making the performance differences.

> On the other hand, if the low speed is not caused by having to

> return 6700, then we should be able to use a similar quick lookup

> table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve

> at least dozens of times performance improvement.

Once again, the big difference is not the "6700". The

tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines

6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of

the work here is in lines 6-11, mainly lines 7 and 10. I did a similar

casual test removing line 12 and just returning something based on the

earlier computations, and it made something like a 30% reduction in

test run time (with or without the hardening asserts nopped out).

Rich