Hi, it seems that the following message that came before the two patches didn't make it to the list, probably because of an eps graphics attachment. So I try it again, without the graphics and I'll try to send the graphic as a pdf in a separate mail. My apologies Jens +++++++++++++++++++++++ Hi everybody, with my current testing of the implementation I came to test and modify some part of the lock and wait primitives in musl. Here is one stress test that I think shows interesting performance differences. At the bottom it just uses only lock-free atomics, so the result itself is independent of . The program is an implementation of a list structure that is shared between threads and used as LIFO (stack). Threads draw a random number and then randomly decide to insert (and malloc) or remove (and free) as many members from the list. The probabilities are such that the list is not growing too large on expectation. All of this is narrowly intertwined between threads, so this results in a lot of allocations that are then usually freed by another thread, and it puts the malloc system under quiet a stress. The question that I asked myself is how much spinning to we need in such lock/wait primitives. In the first attached graphic, the three bottom curves show what difference three spinning strategies can make for this test on a 2x2 hyperthreaded core machine. The bottom is no spinning at all, the next is the current strategy implemented in musl, and the third is a modification of that. As you can see, for a high load of threads they can make a substantial difference, but you also see that musl's actually strategy is not very different from doing no spinning at all. The "new" strategy is simply to avoid to take the shortcut that the actual code takes when spinning. Here, currently when we detect that are already waiters, we stop spinning and try to go into futex_wait immediately. Where this sounds nice at a first glance, the figures seem to indicate that this is not a good idea and that we would be better off without. I'll send a patch for that in a next mail, and also one that lets you modify the number of spins easily. Now the only situation I thought of where this could be important is monoprocessors where actually spinning might not be so good and aborting it early good be necessary. So I rand the same test with taskset to nail the process to just one core. The result of that are the top two curves. As you can see, here the spinning strategy has almost no influence so I think we are safe to apply this patch. Now all of this can also be read as a performance test of the malloc subsystem, and here my feeling goes in the direction that Rich recently indicated. The performance of the "application" is much better if I eliminate all parallelism. As an additional indication there are to additional curves that fix the process to one core and its hyperthreaded sibling. So maybe we might be better off with a simpler malloc strategy that serializes all requests. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::