From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <389ed08237f11e30c2f310f5847e34c1@brasstown.quanstro.net>
References: <9d4d16071350d120a79044a6c0c1604f@felloff.net>
	<0A651ACE-ADD2-4C9F-9491-0802B20923B9@gmail.com>
	<389ed08237f11e30c2f310f5847e34c1@brasstown.quanstro.net>
Date: Thu, 19 Dec 2013 16:57:01 +0100
Message-ID: <CACm3i_h=eTBNsUarY3MMupKveP=goX4JKu3okXi5EVAn7CACLQ@mail.gmail.com>
From: Gorka Guardiola <paurea@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Type: multipart/alternative; boundary=001a11c1de8e5c6e2b04ede5363d
Subject: Re: [9fans] 9front pegs CPU on VMware
Topicbox-Message-UUID: a0efbbfa-ead8-11e9-9d60-3106f5b1d025

--001a11c1de8e5c6e2b04ede5363d
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 19, 2013 at 4:19 PM, erik quanstrom <quanstro@quanstro.net>wrot=
e:

> for those without much mwait experience, mwait is a kernel-only primitive
> (as per the instructions) that pauses the processor until a change has be=
en
> made in some range of memory.  the size is determined by probing the h/w,
> but think cacheline.  so the discussion of locking is kernel specific as
> well.
>
>
The original discussion started about the runq spin lock, but I think the
scope of the
problem is more general and the solution can be applied in user and kernel
space both.
While in user space you would do sleep(0) in the kernel you would sched()
or if you
are in the scheduler you would loop doing mwait (see my last email).

The manual I have available says:

"The MWAIT instruction can be executed at any privilege level. The MONITOR
CPUID feature flag (ECX[bit 3] when CPUID is executed with EAX =3D 1)
indicates the availability of the MONITOR and MWAIT instruction in a
processor. When set, the unconditional execution of MWAIT is supported at
privilege level 0 and conditional execution is supported at privilege
levels 1 through 3 (software should test for the appropriate support of
these instructions before unconditional use)."

There are also other extensions, which I have not tried.
I think the ideas can be used in the kernel or in user space, though I have
only tried
it the kernel and the implementation is only in the kernel right now.


> > > On 17 Dec 2013, at 12:00, cinap_lenrek@felloff.net wrote:
> > >
> > > thats a surprising result. by dog pile lock you mean the runq spinloc=
k
> no?
> > >
> >
> > I guess it depends on the HW, but I don=B4t find that so surprising. Yo=
u
> are looping
> > sending messages to the coherency fabric, which gets congested as a
> result.
> > I have seen that happen.
>
> i assume you mean that there is contention on the cacheline holding the
> runq lock?
> i don't think there's classical congestion.  as i believe cachelines not
> involved in the
> mwait would experience no hold up.
>

I mean congestion in the classical network sense. There are switches and
links to
exchange messages for the coherency protocol and some them get congested.
What I was seeing is the counter of messages growing very very fast and the
performance
degrading which I interpret as something getting congested.
I think when the lock possession is pingponged around (not necessarily
contented,
but many changes in who is holding the lock or maybe contention) many
messages are
generated and then the problem occurs. I certainly saw the HW counters for
messages go up
orders of magnitude when I was not using mwait.

>
> mwait() does improve things and one would expect the latency to always be
> better
> than spining*.  but as it turns out the current scheduler is pretty
> hopeless in its locking
> anyway.  simply grabbing the lock with lock rather than canlock makes mor=
e
> sense to me.
>

These kind of things are subtle, I spent a lot of time measuring and it is
difficult to know
what is happening always for sure and some of the results are
counterintuitive (at least to me)
and depend on the concrete hardware/benchmark/test. So take my conclusions
with a pinch of
salt :-).

I think the latency of mwait (this is what I remember
for the opterons I was measuring, probably different in intel and in other
amd models)
is actually worse (bigger) than with spinning,
but if you have enough processors doing the spinning (not necessarily on
the same locks, but
generating traffic) then at some point it reverses because
of the traffic in the coherency fabric (or thermal effects, I do remember
that
without the mwait all the fans would be up and with the mwait they would
turn off and the
machine would be noticeably cooler). Measure it in your hardware anyway
which will
a) probably be different with a better monitor/mway.
b) you can be sure it works for the loads you are interested in.


> also, using ticket locks (see 9atom nix kernel) will provide automatic
> backoff within the lock.
> ticket locks are a poor solution as they're not really scalable but they
> will scale to 24 cpus
> much better than tas locks.
>
> mcs locks or some other queueing-style lock is clearly the long-term
> solution.  but as
> charles points out one would really perfer to figure out a way to fit the=
m
> to the lock
> api.  i have some test code, but testing queueing locks in user space is
> ... interesting.
> i need a new approach.
>
>
Let us know what your conclusions are after you implement and measure them
:-).

G.

--001a11c1de8e5c6e2b04ede5363d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Thu, Dec 19, 2013 at 4:19 PM, erik quanstrom <span dir=3D"ltr">&=
lt;<a href=3D"mailto:quanstro@quanstro.net" target=3D"_blank">quanstro@quan=
stro.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">for those without much mwait experience, mwait is a kernel=
-only primitive<br>

(as per the instructions) that pauses the processor until a change has been=
<br>
made in some range of memory. =A0the size is determined by probing the h/w,=
<br>
but think cacheline. =A0so the discussion of locking is kernel specific as =
well.<br>
<div class=3D"im"><br></div></blockquote><div>=A0</div><div class=3D"gmail_=
quote">The original discussion started about the runq spin lock, but I thin=
k the scope of the</div><div class=3D"gmail_quote">problem is more general =
and the solution can be applied in user and kernel space both.</div>
<div class=3D"gmail_quote">While in user space you would do sleep(0) in the=
 kernel you would sched() or if you</div><div class=3D"gmail_quote">are in =
the scheduler you would loop doing mwait (see my last email).</div><div cla=
ss=3D"gmail_quote">
<br></div>The manual I have available says:</div><div class=3D"gmail_quote"=
><br>&quot;The MWAIT instruction can be executed at any privilege level. Th=
e MONITOR CPUID feature flag (ECX[bit 3] when CPUID is executed with EAX =
=3D 1) indicates the availability of the MONITOR and MWAIT instruction in a=
 processor. When set, the unconditional execution of MWAIT is supported at =
privilege level 0 and conditional execution is supported at privilege level=
s 1 through 3 (software should test for the appropriate support of these in=
structions before unconditional use).&quot;</div>
<div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">There are a=
lso other extensions, which I have not tried.</div><div class=3D"gmail_quot=
e">I think the ideas can be used in the kernel or in user space, though I h=
ave only tried</div>
<div class=3D"gmail_quote">it the kernel and the implementation is only in =
the kernel right now.</div><div class=3D"gmail_quote"><div>=A0</div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-wi=
dth:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-=
left:1ex">
<div class=3D"im">
&gt; &gt; On 17 Dec 2013, at 12:00, <a href=3D"mailto:cinap_lenrek@felloff.=
net">cinap_lenrek@felloff.net</a> wrote:<br>
&gt; &gt;<br>
&gt; &gt; thats a surprising result. by dog pile lock you mean the runq spi=
nlock no?<br>
&gt; &gt;<br>
&gt;<br>
&gt; I guess it depends on the HW, but I don=B4t find that so surprising. Y=
ou are looping<br>
&gt; sending messages to the coherency fabric, which gets congested as a re=
sult.<br>
&gt; I have seen that happen.<br>
<br>
</div>i assume you mean that there is contention on the cacheline holding t=
he runq lock?<br>
i don&#39;t think there&#39;s classical congestion. =A0as i believe cacheli=
nes not involved in the<br>
mwait would experience no hold up.<br></blockquote><div><br></div><div>I me=
an congestion in the classical network sense. There are switches and links =
to</div><div>exchange messages for the coherency protocol and some them get=
 congested.</div>
<div>What I was seeing is the counter of messages growing very very fast an=
d the performance</div><div>degrading which I interpret as something gettin=
g congested.</div><div>I think when the lock possession is pingponged aroun=
d (not necessarily contented,</div>
<div>but many changes in who is holding the lock or maybe contention) many =
messages are</div><div>generated and then the problem occurs. I certainly s=
aw the HW counters for messages go up</div><div>orders of magnitude when I =
was not using mwait.</div>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D"im">
<br>
</div>mwait() does improve things and one would expect the latency to alway=
s be better<br>
than spining*. =A0but as it turns out the current scheduler is pretty hopel=
ess in its locking<br>
anyway. =A0simply grabbing the lock with lock rather than canlock makes mor=
e sense to me.<br></blockquote><div><br></div><div>These kind of things are=
 subtle, I spent a lot of time measuring and it is difficult to know</div>
<div>what is happening always for sure and some of the results are counteri=
ntuitive (at least to me)</div><div>and depend on the concrete hardware/ben=
chmark/test. So take my conclusions with a pinch of</div><div>salt :-).</di=
v>
<div><br></div><div>I think the latency of mwait (this is what I remember</=
div><div>for the opterons I was measuring, probably different in intel and =
in other amd models)</div><div>is actually worse (bigger) than with spinnin=
g,</div>
<div>but if you have enough processors doing the spinning (not necessarily =
on the same locks, but</div><div>generating traffic) then at some point it =
reverses because</div><div>of the traffic in the coherency fabric (or therm=
al effects, I do remember that</div>
<div>without the mwait all the fans would be up and with the mwait they wou=
ld turn off and the</div><div>machine would be noticeably cooler). Measure =
it in your hardware anyway which will</div><div>a) probably be different wi=
th a better monitor/mway.</div>
<div>b) you can be sure it works for the loads you are interested in.</div>=
<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-lef=
t-style:solid;padding-left:1ex">

<br>
also, using ticket locks (see 9atom nix kernel) will provide automatic back=
off within the lock.<br>
ticket locks are a poor solution as they&#39;re not really scalable but the=
y will scale to 24 cpus<br>
much better than tas locks.<br>
<br>
mcs locks or some other queueing-style lock is clearly the long-term soluti=
on. =A0but as<br>
charles points out one would really perfer to figure out a way to fit them =
to the lock<br>
api. =A0i have some test code, but testing queueing locks in user space is =
... interesting.<br>
i need a new approach.<br><br></blockquote><div><br></div><div>Let us know =
what your conclusions are after you implement and measure them :-).</div><d=
iv><br></div><div>G.</div><div><br></div><div>=A0</div></div></div></div>

--001a11c1de8e5c6e2b04ede5363d--