From: Jens Gustedt <jens.gustedt@inria.fr>
To: musl@lists.openwall.com
Subject: Re: spin strategy in __wait
Date: Sun, 30 Aug 2015 17:31:39 +0200 [thread overview]
Message-ID: <1440948698.693.21.camel@inria.fr> (raw)
In-Reply-To: <20150830005745.GI7833@brightrain.aerifal.cx>
[-- Attachment #1.1: Type: text/plain, Size: 6507 bytes --]
Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > The question that I asked myself is how much spinning to we need in
> > such lock/wait primitives. In the first attached graphic, the three
> > bottom curves show what difference three spinning strategies can make
> > for this test on a 2x2 hyperthreaded core machine. The bottom is no
> > spinning at all, the next is the current strategy implemented in musl,
> > and the third is a modification of that. As you can see, for a high
> > load of threads they can make a substantial difference, but you also
> > see that musl's actually strategy is not very different from doing no
> > spinning at all.
>
> That's intentional when you have waiters. Notice in your graph that
> for the range 2-4 threads, performance in musl is comparable to your
> "no-shortcut" approach. This shows that the spinning is working as
> intended.
In that range, yes. When the machine is overcommitted with, say, 2 times
more threads than the number of cores, it is significantly off.
> BTW I think the names you chose are misleading. :-) Refraining to spin
> is not a "shortcut" musl is taking. It's not trying to be quicker, but
> to refrain from aggressive behavior which should lead to pathological
> lock unfairness. I would call the approaches aggressive-spin,
> opportunistic-spin, and no-spin.
Probably, they were really just "working" names for my scripts.
> > The "new" strategy is simply to avoid to take the shortcut that the
> > actual code takes when spinning. Here, currently when we detect that
> > are already waiters, we stop spinning and try to go into futex_wait
> > immediately. Where this sounds nice at a first glance, the figures
> > seem to indicate that this is not a good idea and that we would be
> > better off without. I'll send a patch for that in a next mail, and
> > also one that lets you modify the number of spins easily.
>
> I don't think you can say we'd be better off in general with a single
> test.
Sure, but it should be taken as a data point. Average behavior of the
malloc implementation seems to be important to me.
> In areas like this you need to be able to demonstrate that there
> are no cases where the expected pathologically bad behavior can
> happen. I suspect you get lucky with malloc because there's a fair
> amount of code being executed between successive attempts on the lock
> rather than intense hammering.
Perhaps. But the behavior that I observe is not so much different
where I used these things as locks for the atomic stuff. There the
work that is done inside the critical section is just a memcpy/memcmp
of 16 byte.
> On the other hand, are you even counting the number of times each
> individual thread performs a malloc/free cycle, or only the total
> across all threads? Is it possible that, among your 256 threads, the
> vast majority of malloc/free cycles are being performed by a small
> number and the rest are stuck repeating futex_wait over and over?
I now also collected these statistics, and, no, all threads get their
share. I attach to data files (tab separated values) that show
this. These files have one line per run (10 sec each), with the
following columns
number of threads
average number per thread of calls to malloc/free during the 10 secs
standard deviation of that quantity
the min value over all threads
the max value over all threads
As you may see, the variation is the biggest around 3-4 threads, but
stays reasonable, I think, from overall min to overall max there is a
difference of 50% or so. Nobody is blocked.
This variation then gets really small for up to 64 threads. For 128
and 256 it is a bit bigger, but overall the numbers are a bit too
small for that cases and one should perhaps run the test for more than
10 secs.
> > Now the only situation I thought of where this could be important is
> > monoprocessors where actually spinning might not be so good and
> > aborting it early good be necessary. So I rand the same test with
> > taskset to nail the process to just one core. The result of that are
> > the top two curves. As you can see, here the spinning strategy has
> > almost no influence so I think we are safe to apply this patch.
>
> This is very surprising to me, which further makes me doubt the test
> methodology. The current number of spins is tuned to be comparable in
> magnitude to the syscall cost, at least on x86.
On my machine, x86_64, with are quite recent kernel it is not, it is a
factor of 10 off. As I said, I observe a factor of 10 between a spin
and a syscall, and not 100 as the existing code suggests.
And, BTW, I also observed that the call to "a_spin" doesn't have much
effect on my machine, neither positive nor negative.
> Adding this extra cost
> to each contended lock operation (you basically never stop spinning
> early on single-core) should produce a measurable difference.
No, on single core you have only one thread active at any time, and
the probability that it is descheduled in the middle when holding the
lock is tiny. So basically nobody is ever spinning as long as the time
that is spend in the critical section is small. (And you may have
sensible outlayers, once a thread is descheduled in the middle.)
> > Now all of this can also be read as a performance test of the malloc
> > subsystem, and here my feeling goes in the direction that Rich
> > recently indicated. The performance of the "application" is much
> > better if I eliminate all parallelism. As an additional indication
> > there are to additional curves that fix the process to one core and
> > its hyperthreaded sibling.
> >
> > So maybe we might be better off with a simpler malloc strategy that
> > serializes all requests.
>
> I tried this a few weeks ago and it performed significantly worse on
> my malloc stress tests. And it will get even worse as we try to scale
> to larger numbers of cores, or worse yet NUMA...
Did you try running your malloc stress test with taskset to nail the
process to one core?
Jens
--
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536 ::
:: :::::::::::::::::::::: gsm France : +33 651400183 ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::
[-- Attachment #1.2: test-16b-spin0-per-thread.csv --]
[-- Type: text/csv, Size: 2623 bytes --]
1 1.01016e+08 -nan 1.01016e+08 1.01016e+08
1 1.04732e+08 -nan 1.04732e+08 1.04732e+08
1 1.04914e+08 -nan 1.04914e+08 1.04914e+08
1 1.04952e+08 -nan 1.04952e+08 1.04952e+08
1 1.05086e+08 -nan 1.05086e+08 1.05086e+08
2 2.46039e+07 2.18992e+06 2.30554e+07 2.61524e+07
2 2.50963e+07 2.39869e+06 2.34001e+07 2.67924e+07
2 2.51581e+07 118988 2.50739e+07 2.52422e+07
2 2.52482e+07 184124 2.5118e+07 2.53784e+07
2 2.53878e+07 77199.8 2.53332e+07 2.54424e+07
3 1.67941e+07 1.22777e+06 1.55103e+07 1.79569e+07
3 1.69299e+07 2.88852e+06 1.37821e+07 1.94588e+07
3 1.69746e+07 1.11383e+06 1.58615e+07 1.80891e+07
3 1.69803e+07 2.72596e+06 1.38945e+07 1.90608e+07
3 1.70611e+07 2.96959e+06 1.43008e+07 2.02031e+07
4 1.06278e+07 54970.5 1.05784e+07 1.06976e+07
4 1.06599e+07 14874.9 1.06389e+07 1.06733e+07
4 1.06778e+07 75440.3 1.05898e+07 1.07472e+07
4 1.07032e+07 309841 1.03648e+07 1.09812e+07
4 1.07155e+07 52456.8 1.06669e+07 1.0786e+07
6 6.23738e+06 52035 6.17275e+06 6.31511e+06
6 6.9085e+06 57309.9 6.833e+06 6.98744e+06
6 6.91585e+06 57435.3 6.80421e+06 6.95838e+06
6 6.92278e+06 41643.7 6.85664e+06 6.98626e+06
6 6.95214e+06 48384.9 6.90512e+06 7.01585e+06
8 5.12103e+06 18367.5 5.09235e+06 5.13965e+06
8 5.16199e+06 23157.4 5.13432e+06 5.20321e+06
8 5.16509e+06 42808 5.08073e+06 5.21662e+06
8 5.16921e+06 22753.1 5.11841e+06 5.18908e+06
8 5.17771e+06 29190.6 5.1326e+06 5.21164e+06
12 3.37277e+06 13336 3.35545e+06 3.39847e+06
12 3.37308e+06 17031.5 3.34394e+06 3.40918e+06
12 3.38457e+06 17371 3.34034e+06 3.40827e+06
12 3.38833e+06 11821.9 3.37193e+06 3.40708e+06
12 3.3892e+06 15952.4 3.36335e+06 3.42432e+06
16 2.49975e+06 8617.97 2.48509e+06 2.51503e+06
16 2.50347e+06 10522.4 2.48523e+06 2.51911e+06
16 2.50555e+06 11512 2.4835e+06 2.53092e+06
16 2.51588e+06 10504.7 2.49727e+06 2.53253e+06
16 2.51896e+06 13934.7 2.48171e+06 2.54308e+06
32 1.21254e+06 7675.72 1.19799e+06 1.22922e+06
32 1.21318e+06 7300.1 1.19331e+06 1.22702e+06
32 1.21326e+06 9136.62 1.19216e+06 1.2298e+06
32 1.21697e+06 9822.09 1.20194e+06 1.24216e+06
32 1.22125e+06 11461.6 1.20222e+06 1.24949e+06
64 502261 4459.83 489331 512667
64 587878 5187.48 569105 601050
64 588159 5264.49 578599 602014
64 588483 7979.81 572640 609086
64 589517 6514.76 577731 609222
128 250213 4264.7 242022 264559
128 287071 7211.82 272031 311657
128 288293 4510.29 278077 302281
128 288667 5755.68 276221 306737
128 288736 6430.4 272257 301830
256 117836 3759.99 109709 130528
256 118531 5745.32 104642 138991
256 136805 6744.08 119832 156503
256 137325 5070.56 124335 150717
256 139372 4256.53 127418 152075
[-- Attachment #1.3: test-16b-no-shortcut-per-thread.csv --]
[-- Type: text/csv, Size: 2595 bytes --]
1 1.02845e+08 -nan 1.02845e+08 1.02845e+08
1 1.04206e+08 -nan 1.04206e+08 1.04206e+08
1 1.04621e+08 -nan 1.04621e+08 1.04621e+08
1 1.04701e+08 -nan 1.04701e+08 1.04701e+08
1 1.04739e+08 -nan 1.04739e+08 1.04739e+08
2 2.64241e+07 509873 2.60636e+07 2.67846e+07
2 2.64415e+07 1.77524e+06 2.51862e+07 2.76968e+07
2 2.65167e+07 10340 2.65094e+07 2.6524e+07
2 2.66158e+07 578717 2.62066e+07 2.7025e+07
2 2.67702e+07 118935 2.66861e+07 2.68543e+07
3 1.82369e+07 1.50911e+06 1.7102e+07 1.99496e+07
3 1.834e+07 2.10534e+06 1.65505e+07 2.06598e+07
3 1.83618e+07 4.53535e+06 1.57391e+07 2.35987e+07
3 1.85352e+07 1.00992e+06 1.74676e+07 1.94754e+07
3 1.91305e+07 3.02333e+06 1.71631e+07 2.26117e+07
4 1.23284e+07 46307 1.22609e+07 1.23644e+07
4 1.23401e+07 46034.7 1.22807e+07 1.23903e+07
4 1.24542e+07 206137 1.21485e+07 1.26e+07
4 1.25545e+07 1.26681e+06 1.09299e+07 1.40236e+07
4 1.29099e+07 49191.2 1.28699e+07 1.2972e+07
6 8.23012e+06 280831 7.89525e+06 8.63334e+06
6 8.2348e+06 483166 7.52975e+06 8.71746e+06
6 8.26691e+06 354248 7.85887e+06 8.8674e+06
6 8.3743e+06 558499 7.79402e+06 9.30234e+06
6 8.73745e+06 131690 8.53941e+06 8.88783e+06
8 6.17623e+06 69216.1 6.12149e+06 6.337e+06
8 6.2033e+06 183890 5.9529e+06 6.52034e+06
8 6.2055e+06 103515 6.03603e+06 6.35771e+06
8 6.25969e+06 152803 6.04837e+06 6.4844e+06
8 6.26351e+06 72339.2 6.16112e+06 6.38241e+06
12 4.11779e+06 98062.7 3.94202e+06 4.24957e+06
12 4.11928e+06 64768.5 4.00667e+06 4.2339e+06
12 4.14352e+06 65757.5 4.03701e+06 4.24415e+06
12 4.16054e+06 65751.6 4.04917e+06 4.27825e+06
12 4.17707e+06 70459.7 4.02992e+06 4.30936e+06
16 3.09386e+06 53225.5 2.98488e+06 3.18386e+06
16 3.09917e+06 54215.3 2.99338e+06 3.20062e+06
16 3.10259e+06 49156.8 3.0009e+06 3.2256e+06
16 3.10438e+06 48870.9 2.97897e+06 3.18441e+06
16 3.10848e+06 50904.3 2.94679e+06 3.17448e+06
32 1.55153e+06 37393 1.4648e+06 1.6139e+06
32 1.55263e+06 30617.3 1.48344e+06 1.61041e+06
32 1.55593e+06 30061.1 1.48991e+06 1.61142e+06
32 1.56356e+06 36687 1.51453e+06 1.64781e+06
32 1.56359e+06 36611.3 1.48372e+06 1.65024e+06
64 767720 23753.9 708521 827938
64 771237 29540.5 701362 863089
64 772007 30036.8 717081 849769
64 775743 27127.9 708516 839689
64 776068 23019 737473 840368
128 380824 21555 328758 435084
128 384902 21596.1 312762 424542
128 386031 24501.2 328767 481418
128 386886 21046 338578 443888
128 387581 23771 345513 466983
256 190511 15216.7 153045 226921
256 190965 15672.1 140328 244286
256 191556 16978 136718 237745
256 192105 16504.2 155514 239763
256 192403 15898.5 147119 239026
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2015-08-30 15:31 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-29 19:14 Jens Gustedt
2015-08-29 19:16 ` Jens Gustedt
2015-08-30 0:57 ` Rich Felker
2015-08-30 15:31 ` Jens Gustedt [this message]
2015-08-30 17:18 ` Rich Felker
2015-08-30 21:38 ` Jens Gustedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1440948698.693.21.camel@inria.fr \
--to=jens.gustedt@inria.fr \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).