spin strategy in _

mailing list of musl libc
 help / color / mirror / code / Atom feed

* spin strategy in __wait
@ 2015-08-29 19:14 Jens Gustedt
  2015-08-29 19:16 ` Jens Gustedt
  2015-08-30  0:57 ` Rich Felker
  0 siblings, 2 replies; 6+ messages in thread
From: Jens Gustedt @ 2015-08-29 19:14 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]

Hi,
it seems that the following message that came before the two patches
didn't make it to the list, probably because of an eps graphics
attachment. So I try it again, without the graphics and I'll try to
send the graphic as a pdf in a separate mail.
My apologies
Jens

+++++++++++++++++++++++

Hi everybody,

with my current testing of the <stdatomic.h> implementation I came to
test and modify some part of the lock and wait primitives in
musl. Here is one stress test that I think shows interesting
performance differences. At the bottom it just uses only lock-free
atomics, so the result itself is independent of <stdatomic.h>.

The program is an implementation of a list structure that is shared
between threads and used as LIFO (stack). Threads draw a random number
and then randomly decide to insert (and malloc) or remove (and free)
as many members from the list. The probabilities are such that the
list is not growing too large on expectation.

All of this is narrowly intertwined between threads, so this results
in a lot of allocations that are then usually freed by another thread,
and it puts the malloc system under quiet a stress.

The question that I asked myself is how much spinning to we need in
such lock/wait primitives. In the first attached graphic, the three
bottom curves show what difference three spinning strategies can make
for this test on a 2x2 hyperthreaded core machine. The bottom is no
spinning at all, the next is the current strategy implemented in musl,
and the third is a modification of that. As you can see, for a high
load of threads they can make a substantial difference, but you also
see that musl's actually strategy is not very different from doing no
spinning at all.

The "new" strategy is simply to avoid to take the shortcut that the
actual code takes when spinning. Here, currently when we detect that
are already waiters, we stop spinning and try to go into futex_wait
immediately. Where this sounds nice at a first glance, the figures
seem to indicate that this is not a good idea and that we would be
better off without. I'll send a patch for that in a next mail, and
also one that lets you modify the number of spins easily.

Now the only situation I thought of where this could be important is
monoprocessors where actually spinning might not be so good and
aborting it early good be necessary. So I rand the same test with
taskset to nail the process to just one core. The result of that are
the top two curves. As you can see, here the spinning strategy has
almost no influence so I think we are safe to apply this patch.

Now all of this can also be read as a performance test of the malloc
subsystem, and here my feeling goes in the direction that Rich
recently indicated. The performance of the "application" is much
better if I eliminate all parallelism. As an additional indication
there are to additional curves that fix the process to one core and
its hyperthreaded sibling.

So maybe we might be better off with a simpler malloc strategy that
serializes all requests.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: spin strategy in __wait
  2015-08-29 19:14 spin strategy in __wait Jens Gustedt
@ 2015-08-29 19:16 ` Jens Gustedt
  2015-08-30  0:57 ` Rich Felker
  1 sibling, 0 replies; 6+ messages in thread
From: Jens Gustedt @ 2015-08-29 19:16 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 554 bytes --]

Am Samstag, den 29.08.2015, 21:14 +0200 schrieb Jens Gustedt:
> So I try it again, without the graphics and I'll try to
> send the graphic as a pdf in a separate mail.

So here is the attachment, hopefully it will make it this time.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #1.2: test-spin-shortcut-all.pdf --]
[-- Type: application/pdf, Size: 7980 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: spin strategy in __wait
  2015-08-29 19:14 spin strategy in __wait Jens Gustedt
  2015-08-29 19:16 ` Jens Gustedt
@ 2015-08-30  0:57 ` Rich Felker
  2015-08-30 15:31   ` Jens Gustedt
  1 sibling, 1 reply; 6+ messages in thread
From: Rich Felker @ 2015-08-30  0:57 UTC (permalink / raw)
  To: musl

On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> The question that I asked myself is how much spinning to we need in
> such lock/wait primitives. In the first attached graphic, the three
> bottom curves show what difference three spinning strategies can make
> for this test on a 2x2 hyperthreaded core machine. The bottom is no
> spinning at all, the next is the current strategy implemented in musl,
> and the third is a modification of that. As you can see, for a high
> load of threads they can make a substantial difference, but you also
> see that musl's actually strategy is not very different from doing no
> spinning at all.

That's intentional when you have waiters. Notice in your graph that
for the range 2-4 threads, performance in musl is comparable to your
"no-shortcut" approach. This shows that the spinning is working as
intended.

BTW I think the names you chose are misleading. :-) Refraining to spin
is not a "shortcut" musl is taking. It's not trying to be quicker, but
to refrain from aggressive behavior which should lead to pathological
lock unfairness. I would call the approaches aggressive-spin,
opportunistic-spin, and no-spin.

> The "new" strategy is simply to avoid to take the shortcut that the
> actual code takes when spinning. Here, currently when we detect that
> are already waiters, we stop spinning and try to go into futex_wait
> immediately. Where this sounds nice at a first glance, the figures
> seem to indicate that this is not a good idea and that we would be
> better off without. I'll send a patch for that in a next mail, and
> also one that lets you modify the number of spins easily.

I don't think you can say we'd be better off in general with a single
test. In areas like this you need to be able to demonstrate that there
are no cases where the expected pathologically bad behavior can
happen. I suspect you get lucky with malloc because there's a fair
amount of code being executed between successive attempts on the lock
rather than intense hammering.

On the other hand, are you even counting the number of times each
individual thread performs a malloc/free cycle, or only the total
across all threads? Is it possible that, among your 256 threads, the
vast majority of malloc/free cycles are being performed by a small
number and the rest are stuck repeating futex_wait over and over?

> Now the only situation I thought of where this could be important is
> monoprocessors where actually spinning might not be so good and
> aborting it early good be necessary. So I rand the same test with
> taskset to nail the process to just one core. The result of that are
> the top two curves. As you can see, here the spinning strategy has
> almost no influence so I think we are safe to apply this patch.

This is very surprising to me, which further makes me doubt the test
methodology. The current number of spins is tuned to be comparable in
magnitude to the syscall cost, at least on x86. Adding this extra cost
to each contended lock operation (you basically never stop spinning
early on single-core) should produce a measurable difference.

> Now all of this can also be read as a performance test of the malloc
> subsystem, and here my feeling goes in the direction that Rich
> recently indicated. The performance of the "application" is much
> better if I eliminate all parallelism. As an additional indication
> there are to additional curves that fix the process to one core and
> its hyperthreaded sibling.
> 
> So maybe we might be better off with a simpler malloc strategy that
> serializes all requests.

I tried this a few weeks ago and it performed significantly worse on
my malloc stress tests. And it will get even worse as we try to scale
to larger numbers of cores, or worse yet NUMA...

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: spin strategy in __wait
  2015-08-30  0:57 ` Rich Felker
@ 2015-08-30 15:31   ` Jens Gustedt
  2015-08-30 17:18     ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Gustedt @ 2015-08-30 15:31 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 6507 bytes --]

Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > The question that I asked myself is how much spinning to we need in
> > such lock/wait primitives. In the first attached graphic, the three
> > bottom curves show what difference three spinning strategies can make
> > for this test on a 2x2 hyperthreaded core machine. The bottom is no
> > spinning at all, the next is the current strategy implemented in musl,
> > and the third is a modification of that. As you can see, for a high
> > load of threads they can make a substantial difference, but you also
> > see that musl's actually strategy is not very different from doing no
> > spinning at all.
> 
> That's intentional when you have waiters. Notice in your graph that
> for the range 2-4 threads, performance in musl is comparable to your
> "no-shortcut" approach. This shows that the spinning is working as
> intended.

In that range, yes. When the machine is overcommitted with, say, 2 times
more threads than the number of cores, it is significantly off.

> BTW I think the names you chose are misleading. :-) Refraining to spin
> is not a "shortcut" musl is taking. It's not trying to be quicker, but
> to refrain from aggressive behavior which should lead to pathological
> lock unfairness. I would call the approaches aggressive-spin,
> opportunistic-spin, and no-spin.

Probably, they were really just "working" names for my scripts.

> > The "new" strategy is simply to avoid to take the shortcut that the
> > actual code takes when spinning. Here, currently when we detect that
> > are already waiters, we stop spinning and try to go into futex_wait
> > immediately. Where this sounds nice at a first glance, the figures
> > seem to indicate that this is not a good idea and that we would be
> > better off without. I'll send a patch for that in a next mail, and
> > also one that lets you modify the number of spins easily.
> 
> I don't think you can say we'd be better off in general with a single
> test.

Sure, but it should be taken as a data point. Average behavior of the
malloc implementation seems to be important to me.

> In areas like this you need to be able to demonstrate that there
> are no cases where the expected pathologically bad behavior can
> happen. I suspect you get lucky with malloc because there's a fair
> amount of code being executed between successive attempts on the lock
> rather than intense hammering.

Perhaps. But the behavior that I observe is not so much different
where I used these things as locks for the atomic stuff. There the
work that is done inside the critical section is just a memcpy/memcmp
of 16 byte.

> On the other hand, are you even counting the number of times each
> individual thread performs a malloc/free cycle, or only the total
> across all threads? Is it possible that, among your 256 threads, the
> vast majority of malloc/free cycles are being performed by a small
> number and the rest are stuck repeating futex_wait over and over?

I now also collected these statistics, and, no, all threads get their
share. I attach to data files (tab separated values) that show
this. These files have one line per run (10 sec each), with the
following columns

number of threads
average number per thread of calls to malloc/free during the 10 secs
standard deviation of that quantity
the min value over all threads
the max value over all threads

As you may see, the variation is the biggest around 3-4 threads, but
stays reasonable, I think, from overall min to overall max there is a
difference of 50% or so. Nobody is blocked.

This variation then gets really small for up to 64 threads. For 128
and 256 it is a bit bigger, but overall the numbers are a bit too
small for that cases and one should perhaps run the test for more than
10 secs.

> > Now the only situation I thought of where this could be important is
> > monoprocessors where actually spinning might not be so good and
> > aborting it early good be necessary. So I rand the same test with
> > taskset to nail the process to just one core. The result of that are
> > the top two curves. As you can see, here the spinning strategy has
> > almost no influence so I think we are safe to apply this patch.
> 
> This is very surprising to me, which further makes me doubt the test
> methodology. The current number of spins is tuned to be comparable in
> magnitude to the syscall cost, at least on x86.

On my machine, x86_64, with are quite recent kernel it is not, it is a
factor of 10 off. As I said, I observe a factor of 10 between a spin
and a syscall, and not 100 as the existing code suggests.

And, BTW, I also observed that the call to "a_spin" doesn't have much
effect on my machine, neither positive nor negative.

> Adding this extra cost
> to each contended lock operation (you basically never stop spinning
> early on single-core) should produce a measurable difference.

No, on single core you have only one thread active at any time, and
the probability that it is descheduled in the middle when holding the
lock is tiny. So basically nobody is ever spinning as long as the time
that is spend in the critical section is small. (And you may have
sensible outlayers, once a thread is descheduled in the middle.)

> > Now all of this can also be read as a performance test of the malloc
> > subsystem, and here my feeling goes in the direction that Rich
> > recently indicated. The performance of the "application" is much
> > better if I eliminate all parallelism. As an additional indication
> > there are to additional curves that fix the process to one core and
> > its hyperthreaded sibling.
> > 
> > So maybe we might be better off with a simpler malloc strategy that
> > serializes all requests.
> 
> I tried this a few weeks ago and it performed significantly worse on
> my malloc stress tests. And it will get even worse as we try to scale
> to larger numbers of cores, or worse yet NUMA...

Did you try running your malloc stress test with taskset to nail the
process to one core?

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #1.2: test-16b-spin0-per-thread.csv --]
[-- Type: text/csv, Size: 2623 bytes --]

	1	1.01016e+08	-nan	1.01016e+08	1.01016e+08
	1	1.04732e+08	-nan	1.04732e+08	1.04732e+08
	1	1.04914e+08	-nan	1.04914e+08	1.04914e+08
	1	1.04952e+08	-nan	1.04952e+08	1.04952e+08
	1	1.05086e+08	-nan	1.05086e+08	1.05086e+08
	2	2.46039e+07	2.18992e+06	2.30554e+07	2.61524e+07
	2	2.50963e+07	2.39869e+06	2.34001e+07	2.67924e+07
	2	2.51581e+07	118988	2.50739e+07	2.52422e+07
	2	2.52482e+07	184124	2.5118e+07	2.53784e+07
	2	2.53878e+07	77199.8	2.53332e+07	2.54424e+07
	3	1.67941e+07	1.22777e+06	1.55103e+07	1.79569e+07
	3	1.69299e+07	2.88852e+06	1.37821e+07	1.94588e+07
	3	1.69746e+07	1.11383e+06	1.58615e+07	1.80891e+07
	3	1.69803e+07	2.72596e+06	1.38945e+07	1.90608e+07
	3	1.70611e+07	2.96959e+06	1.43008e+07	2.02031e+07
	4	1.06278e+07	54970.5	1.05784e+07	1.06976e+07
	4	1.06599e+07	14874.9	1.06389e+07	1.06733e+07
	4	1.06778e+07	75440.3	1.05898e+07	1.07472e+07
	4	1.07032e+07	309841	1.03648e+07	1.09812e+07
	4	1.07155e+07	52456.8	1.06669e+07	1.0786e+07
	6	6.23738e+06	52035	6.17275e+06	6.31511e+06
	6	6.9085e+06	57309.9	6.833e+06	6.98744e+06
	6	6.91585e+06	57435.3	6.80421e+06	6.95838e+06
	6	6.92278e+06	41643.7	6.85664e+06	6.98626e+06
	6	6.95214e+06	48384.9	6.90512e+06	7.01585e+06
	8	5.12103e+06	18367.5	5.09235e+06	5.13965e+06
	8	5.16199e+06	23157.4	5.13432e+06	5.20321e+06
	8	5.16509e+06	42808	5.08073e+06	5.21662e+06
	8	5.16921e+06	22753.1	5.11841e+06	5.18908e+06
	8	5.17771e+06	29190.6	5.1326e+06	5.21164e+06
	12	3.37277e+06	13336	3.35545e+06	3.39847e+06
	12	3.37308e+06	17031.5	3.34394e+06	3.40918e+06
	12	3.38457e+06	17371	3.34034e+06	3.40827e+06
	12	3.38833e+06	11821.9	3.37193e+06	3.40708e+06
	12	3.3892e+06	15952.4	3.36335e+06	3.42432e+06
	16	2.49975e+06	8617.97	2.48509e+06	2.51503e+06
	16	2.50347e+06	10522.4	2.48523e+06	2.51911e+06
	16	2.50555e+06	11512	2.4835e+06	2.53092e+06
	16	2.51588e+06	10504.7	2.49727e+06	2.53253e+06
	16	2.51896e+06	13934.7	2.48171e+06	2.54308e+06
	32	1.21254e+06	7675.72	1.19799e+06	1.22922e+06
	32	1.21318e+06	7300.1	1.19331e+06	1.22702e+06
	32	1.21326e+06	9136.62	1.19216e+06	1.2298e+06
	32	1.21697e+06	9822.09	1.20194e+06	1.24216e+06
	32	1.22125e+06	11461.6	1.20222e+06	1.24949e+06
	64	502261	4459.83	489331	512667
	64	587878	5187.48	569105	601050
	64	588159	5264.49	578599	602014
	64	588483	7979.81	572640	609086
	64	589517	6514.76	577731	609222
	128	250213	4264.7	242022	264559
	128	287071	7211.82	272031	311657
	128	288293	4510.29	278077	302281
	128	288667	5755.68	276221	306737
	128	288736	6430.4	272257	301830
	256	117836	3759.99	109709	130528
	256	118531	5745.32	104642	138991
	256	136805	6744.08	119832	156503
	256	137325	5070.56	124335	150717
	256	139372	4256.53	127418	152075

[-- Attachment #1.3: test-16b-no-shortcut-per-thread.csv --]
[-- Type: text/csv, Size: 2595 bytes --]

	1	1.02845e+08	-nan	1.02845e+08	1.02845e+08
	1	1.04206e+08	-nan	1.04206e+08	1.04206e+08
	1	1.04621e+08	-nan	1.04621e+08	1.04621e+08
	1	1.04701e+08	-nan	1.04701e+08	1.04701e+08
	1	1.04739e+08	-nan	1.04739e+08	1.04739e+08
	2	2.64241e+07	509873	2.60636e+07	2.67846e+07
	2	2.64415e+07	1.77524e+06	2.51862e+07	2.76968e+07
	2	2.65167e+07	10340	2.65094e+07	2.6524e+07
	2	2.66158e+07	578717	2.62066e+07	2.7025e+07
	2	2.67702e+07	118935	2.66861e+07	2.68543e+07
	3	1.82369e+07	1.50911e+06	1.7102e+07	1.99496e+07
	3	1.834e+07	2.10534e+06	1.65505e+07	2.06598e+07
	3	1.83618e+07	4.53535e+06	1.57391e+07	2.35987e+07
	3	1.85352e+07	1.00992e+06	1.74676e+07	1.94754e+07
	3	1.91305e+07	3.02333e+06	1.71631e+07	2.26117e+07
	4	1.23284e+07	46307	1.22609e+07	1.23644e+07
	4	1.23401e+07	46034.7	1.22807e+07	1.23903e+07
	4	1.24542e+07	206137	1.21485e+07	1.26e+07
	4	1.25545e+07	1.26681e+06	1.09299e+07	1.40236e+07
	4	1.29099e+07	49191.2	1.28699e+07	1.2972e+07
	6	8.23012e+06	280831	7.89525e+06	8.63334e+06
	6	8.2348e+06	483166	7.52975e+06	8.71746e+06
	6	8.26691e+06	354248	7.85887e+06	8.8674e+06
	6	8.3743e+06	558499	7.79402e+06	9.30234e+06
	6	8.73745e+06	131690	8.53941e+06	8.88783e+06
	8	6.17623e+06	69216.1	6.12149e+06	6.337e+06
	8	6.2033e+06	183890	5.9529e+06	6.52034e+06
	8	6.2055e+06	103515	6.03603e+06	6.35771e+06
	8	6.25969e+06	152803	6.04837e+06	6.4844e+06
	8	6.26351e+06	72339.2	6.16112e+06	6.38241e+06
	12	4.11779e+06	98062.7	3.94202e+06	4.24957e+06
	12	4.11928e+06	64768.5	4.00667e+06	4.2339e+06
	12	4.14352e+06	65757.5	4.03701e+06	4.24415e+06
	12	4.16054e+06	65751.6	4.04917e+06	4.27825e+06
	12	4.17707e+06	70459.7	4.02992e+06	4.30936e+06
	16	3.09386e+06	53225.5	2.98488e+06	3.18386e+06
	16	3.09917e+06	54215.3	2.99338e+06	3.20062e+06
	16	3.10259e+06	49156.8	3.0009e+06	3.2256e+06
	16	3.10438e+06	48870.9	2.97897e+06	3.18441e+06
	16	3.10848e+06	50904.3	2.94679e+06	3.17448e+06
	32	1.55153e+06	37393	1.4648e+06	1.6139e+06
	32	1.55263e+06	30617.3	1.48344e+06	1.61041e+06
	32	1.55593e+06	30061.1	1.48991e+06	1.61142e+06
	32	1.56356e+06	36687	1.51453e+06	1.64781e+06
	32	1.56359e+06	36611.3	1.48372e+06	1.65024e+06
	64	767720	23753.9	708521	827938
	64	771237	29540.5	701362	863089
	64	772007	30036.8	717081	849769
	64	775743	27127.9	708516	839689
	64	776068	23019	737473	840368
	128	380824	21555	328758	435084
	128	384902	21596.1	312762	424542
	128	386031	24501.2	328767	481418
	128	386886	21046	338578	443888
	128	387581	23771	345513	466983
	256	190511	15216.7	153045	226921
	256	190965	15672.1	140328	244286
	256	191556	16978	136718	237745
	256	192105	16504.2	155514	239763
	256	192403	15898.5	147119	239026

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: spin strategy in __wait
  2015-08-30 15:31   ` Jens Gustedt
@ 2015-08-30 17:18     ` Rich Felker
  2015-08-30 21:38       ` Jens Gustedt
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2015-08-30 17:18 UTC (permalink / raw)
  To: musl

On Sun, Aug 30, 2015 at 05:31:39PM +0200, Jens Gustedt wrote:
> Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> > On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > > The question that I asked myself is how much spinning to we need in
> > > such lock/wait primitives. In the first attached graphic, the three
> > > bottom curves show what difference three spinning strategies can make
> > > for this test on a 2x2 hyperthreaded core machine. The bottom is no
> > > spinning at all, the next is the current strategy implemented in musl,
> > > and the third is a modification of that. As you can see, for a high
> > > load of threads they can make a substantial difference, but you also
> > > see that musl's actually strategy is not very different from doing no
> > > spinning at all.
> > 
> > That's intentional when you have waiters. Notice in your graph that
> > for the range 2-4 threads, performance in musl is comparable to your
> > "no-shortcut" approach. This shows that the spinning is working as
> > intended.
> 
> In that range, yes. When the machine is overcommitted with, say, 2 times
> more threads than the number of cores, it is significantly off.

I said "as intended". The intent is that no spinning happen when there
are more threads than cores. You're free to claim this is suboptimal
for raw performance, but it's the intended behavior.

> > On the other hand, are you even counting the number of times each
> > individual thread performs a malloc/free cycle, or only the total
> > across all threads? Is it possible that, among your 256 threads, the
> > vast majority of malloc/free cycles are being performed by a small
> > number and the rest are stuck repeating futex_wait over and over?
> 
> I now also collected these statistics, and, no, all threads get their
> share. I attach to data files (tab separated values) that show
> this. These files have one line per run (10 sec each), with the
> following columns
> 
> number of threads
> average number per thread of calls to malloc/free during the 10 secs
> standard deviation of that quantity
> the min value over all threads
> the max value over all threads
> 
> As you may see, the variation is the biggest around 3-4 threads, but
> stays reasonable, I think, from overall min to overall max there is a
> difference of 50% or so. Nobody is blocked.
> 
> This variation then gets really small for up to 64 threads. For 128
> and 256 it is a bit bigger, but overall the numbers are a bit too
> small for that cases and one should perhaps run the test for more than
> 10 secs.

Interesting.

> > > Now the only situation I thought of where this could be important is
> > > monoprocessors where actually spinning might not be so good and
> > > aborting it early good be necessary. So I rand the same test with
> > > taskset to nail the process to just one core. The result of that are
> > > the top two curves. As you can see, here the spinning strategy has
> > > almost no influence so I think we are safe to apply this patch.
> > 
> > This is very surprising to me, which further makes me doubt the test
> > methodology. The current number of spins is tuned to be comparable in
> > magnitude to the syscall cost, at least on x86.
> 
> On my machine, x86_64, with are quite recent kernel it is not, it is a
> factor of 10 off. As I said, I observe a factor of 10 between a spin
> and a syscall, and not 100 as the existing code suggests.
> 
> And, BTW, I also observed that the call to "a_spin" doesn't have much
> effect on my machine, neither positive nor negative.
> 
> > Adding this extra cost
> > to each contended lock operation (you basically never stop spinning
> > early on single-core) should produce a measurable difference.
> 
> No, on single core you have only one thread active at any time, and
> the probability that it is descheduled in the middle when holding the
> lock is tiny. So basically nobody is ever spinning as long as the time
> that is spend in the critical section is small. (And you may have
> sensible outlayers, once a thread is descheduled in the middle.)

Consider what happens when the thread holding the lock is preempted.

If there is no spinning, the futex_wait syscall should happen
immediately and cause the lock-holder to be scheduled again.

If there is spinning, you have a 10x syscall-time (by your
measurement) additional delay before the lock-holder gets scheduled
again.

I'm failing to see why this does not have a significant effect on
performance. Maybe the kernel just does a good job of optimizing for
not preempting a lock-holder. This optimization would naturally happen
if the kernel chooses to and a task's timeslice early when it makes a
syscall with a trivial amount of time remaining to run after the
syscall would return, which is a good optimization to make because
it's cheaper to switch tasks when you're already in kernelspace than
via a timer interrupt a few microseconds later.

> > > Now all of this can also be read as a performance test of the malloc
> > > subsystem, and here my feeling goes in the direction that Rich
> > > recently indicated. The performance of the "application" is much
> > > better if I eliminate all parallelism. As an additional indication
> > > there are to additional curves that fix the process to one core and
> > > its hyperthreaded sibling.
> > > 
> > > So maybe we might be better off with a simpler malloc strategy that
> > > serializes all requests.
> > 
> > I tried this a few weeks ago and it performed significantly worse on
> > my malloc stress tests. And it will get even worse as we try to scale
> > to larger numbers of cores, or worse yet NUMA...
> 
> Did you try running your malloc stress test with taskset to nail the
> process to one core?

No. I could do that, but I don't think there's any sense in optimizing
for multi-threaded load on single-core at the expense of multi-core
performance.

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: spin strategy in __wait
  2015-08-30 17:18     ` Rich Felker
@ 2015-08-30 21:38       ` Jens Gustedt
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Gustedt @ 2015-08-30 21:38 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5469 bytes --]

Am Sonntag, den 30.08.2015, 13:18 -0400 schrieb Rich Felker:
> On Sun, Aug 30, 2015 at 05:31:39PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> > > On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > > > Now the only situation I thought of where this could be important is
> > > > monoprocessors where actually spinning might not be so good and
> > > > aborting it early good be necessary. So I rand the same test with
> > > > taskset to nail the process to just one core. The result of that are
> > > > the top two curves. As you can see, here the spinning strategy has
> > > > almost no influence so I think we are safe to apply this patch.
> > > 
> > > This is very surprising to me, which further makes me doubt the test
> > > methodology. The current number of spins is tuned to be comparable in
> > > magnitude to the syscall cost, at least on x86.
> > 
> > On my machine, x86_64, with are quite recent kernel it is not, it is a
> > factor of 10 off. As I said, I observe a factor of 10 between a spin
> > and a syscall, and not 100 as the existing code suggests.
> > 
> > And, BTW, I also observed that the call to "a_spin" doesn't have much
> > effect on my machine, neither positive nor negative.
> > 
> > > Adding this extra cost
> > > to each contended lock operation (you basically never stop spinning
> > > early on single-core) should produce a measurable difference.
> > 
> > No, on single core you have only one thread active at any time, and
> > the probability that it is descheduled in the middle when holding the
> > lock is tiny. So basically nobody is ever spinning as long as the time
> > that is spend in the critical section is small. (And you may have
> > sensible outlayers, once a thread is descheduled in the middle.)
> 
> Consider what happens when the thread holding the lock is preempted.
> 
> If there is no spinning, the futex_wait syscall should happen
> immediately and cause the lock-holder to be scheduled again.
> 
> If there is spinning, you have a 10x syscall-time (by your
> measurement) additional delay before the lock-holder gets scheduled
> again.
> 
> I'm failing to see why this does not have a significant effect on
> performance.

I think that for the average performance this has no effect because
the probability that such an even occurs is really small, the number
of exceptional cases is several orders of magnitude less than the
"normal" case. So you simply can't see it on average.

Second, the work that is done by spinning is still considerably less
in that benchmark than the "real" work that is done inside the
critical section(s).

So statistically, you just can't see it.

With the <stdatomic.h> code I did experiments that randomly preemted
threads in the middle of the critical section. The results show that
everything stays quiet stable even with a substantial share of such
preemtions.

> Maybe the kernel just does a good job of optimizing for
> not preempting a lock-holder. This optimization would naturally happen
> if the kernel chooses to and a task's timeslice early when it makes a
> syscall with a trivial amount of time remaining to run after the
> syscall would return, which is a good optimization to make because
> it's cheaper to switch tasks when you're already in kernelspace than
> via a timer interrupt a few microseconds later.

interesting theory

The only thing that I can deduce so far, is that these things are much
more stable than I would have expected. Linux is a marvelous tool.

> > > > Now all of this can also be read as a performance test of the malloc
> > > > subsystem, and here my feeling goes in the direction that Rich
> > > > recently indicated. The performance of the "application" is much
> > > > better if I eliminate all parallelism. As an additional indication
> > > > there are to additional curves that fix the process to one core and
> > > > its hyperthreaded sibling.
> > > > 
> > > > So maybe we might be better off with a simpler malloc strategy that
> > > > serializes all requests.
> > > 
> > > I tried this a few weeks ago and it performed significantly worse on
> > > my malloc stress tests. And it will get even worse as we try to scale
> > > to larger numbers of cores, or worse yet NUMA...
> > 
> > Did you try running your malloc stress test with taskset to nail the
> > process to one core?
> 
> No. I could do that, but I don't think there's any sense in optimizing
> for multi-threaded load on single-core at the expense of multi-core
> performance.

This was not my idea, and that is not what is going on. On the
single-core the performance *is* already much, much better than on
multi-core. So you did that optimization already :)

In a perfect world, a multi-threaded malloc stress test that is bound
to one core should drop performance, not increase it. If there is no
parallel gain with the malloc implementation one should just be better
off with a serialized malloc.

I'd just like to know what's going on. It would be good to have a
second data point.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-30 21:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-29 19:14 spin strategy in __wait Jens Gustedt
2015-08-29 19:16 ` Jens Gustedt
2015-08-30  0:57 ` Rich Felker
2015-08-30 15:31   ` Jens Gustedt
2015-08-30 17:18     ` Rich Felker
2015-08-30 21:38       ` Jens Gustedt

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).