From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <3d8d9b061e102356bda61520e1682072@ivey>
References: <ac0a74a8fd1972cd1c72c7b5983fdc4e@ladd.quanstro.net>
	<CAFgOgC-fuknRtNM1mVK+S2g5LYTL+z3wCR0iZOdUSoqi3Fo51Q@mail.gmail.com>
	<a593342e60fac111d7ca7146cdca7ff8@ladd.quanstro.net>
	<CAFgOgC91+3NNYAfp0FYS6uEaktWhLfNfr=xcVWFGJBS_-_-i1Q@mail.gmail.com>
	<c50dff498502b93c8e9513c0cbbb1ffd@ivey>
	<CAFgOgC-=2z7JknR74-4geMwm2haV52FgtVA+U9wmOU11oTwrxQ@mail.gmail.com>
	<1203fd06e0a13df9c23d4a5e1ac1aad0@ladd.quanstro.net>
	<CAFgOgC-+doRSYxihs02gCR--_cPKnvMyEswLfyEZsEsYVd90MA@mail.gmail.com>
	<3d8d9b061e102356bda61520e1682072@ivey>
Date: Tue, 20 May 2014 15:14:32 -0400
Message-ID: <CAFgOgC8X+sX74OS3RxG0kE6h_Dn42QstWdAK8NHsGhaE1L0OAQ@mail.gmail.com>
From: "Devon H. O'Dell" <devon.odell@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans] waitfree
Topicbox-Message-UUID: ea51a326-ead8-11e9-9d60-3106f5b1d025

2014-05-20 11:41 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
>> Dunno what to say. I'm not trying this on Plan 9, and I can't
>> reproduce your results on an i7 or an e5-2690. I'm certainly not
>> claiming that all pipelines, processors, and caches are equal, but
>> I've simply never seen this behavior. I also can't think of an
>> application in which one would want to execute a million consecutive
>> LOCK-prefixed instructions. Perhaps I just lack imagination.
>
> the original question was, are locked instructions wait free.  the
> purpose was to see if there could be more than a few percent
> variation, and it appears there can be.

Your results suggest so, but I'm not able to reproduce your results
and I have not seen behavior like this in the applications I work on.
Also just tried on an X3450 with no luck. I think this is a problem
with the benchmark and not suggestive of behavior of the processor.

Did you use a semaphore to ensure the threads all start execution at
roughly the same time? If not, I can explain away the result as the
process on CPU 4 gets to start with no contention on the memory bus;
it is always able to get the value from cache because nothing has
modified that address since it broadcast its modification. Cores 5, 6,
and 7 start with progressively more contention to main memory, and
consequently take longer to finish due continually loading from main
memory for a larger percentage of the 1MM reads they do.

If that's the case, it's not a fair benchmark.

> i think in modern intel microarches this question boils down to,
> can a given cpu thread in an arbitrary topology transition an
> arbitrary cacheline from an arbitrary state to state E in a bounded
> number of QPI cycles.

>>From [1], emphasis mine:
    A cacheline can be put in an Exclusive state (E) in response to a =E2=
=80=9Cread for
    ownership=E2=80=9D (RFO) in order to store a value. **All instructions
containing a lock prefix will
    result in a (RFO) since they always result in a write to the cache line=
.**

The same document continues on to mention how the read is performed,
and what the behavior is when line fill buffers and load / store
buffers are filled. In these cases, there is total ordering: the core
is no longer permitted to "issue =CE=BCops to
the RS and OOO engine. This is the same mechanism as used in Core=E2=84=A22
processors to maintain pipeline consistency."

But that the LOCK prefix makes the read =CE=BCop do RFO is useful to know
so we can test the actual behavior. Your question can be restated:

Is RFO implemented as an exchange retry loop?

I can't think of any reason it should be implemented in that way as
long as the cache protocol has a total order (which it must given that
the =CE=BCops that generate the cache coherency protocol traffic have a
total order), a state transition from X to E can be done in a bounded
number of cycles.

I'll assume for the sake of argument that it isn't bounded and is in
fact implemented as a retry loop. If the read =CE=BCop from the decomposed
instruction would loop trying to get the state to E, it would be
implemented as something like:

cacheline_t RFO(ptr_t addr) {
  cache_state_t st =3D get_cache_state(addr);
  while (xchg(st, E) !=3D st) {
    st =3D get_cache_state(addr);
  }
  return read(addr);
}

The read function will try to find a value for addr in cache, then
from memory. If the LOCK-prefixed instruction's decomposed read =CE=BCop
results in this behavior, a RFO miss can and will happen multiple
times. This will stall the pipeline for multiple memory lookups. You
can detect this with pipeline stall performance counters that will be
measurably (with significance) higher on the starved threads.
Otherwise, the pipeline stall counter should closely match the RFO
miss and cache miss counters.

> of course this reformulation isn't particular helpful, at least to me,
> given the vagueness in the descriptions i've seen.
>
> this is practically important because if there is a dogpile lock or
> bit of shared memory in the system, then a particular cpu may
> end up waiting quite a long time to acquire the lock.  it may
> even get starved out for a bit, perhaps with a subset of other cpus
> "batting around" more than once.

If your benchmark is valid, it still required you to execute 1MM
lock-prefixed instructions in order without branching. This will never
happen with spinlocks or wait- and lock-free data structures. Most of
the instructions for a fair lock (ticket, MCS, Anderson, etc.) are not
locking; most of the instructions for a test-and-test-and-set mutex
with exponential backoff are not locking, and naive test-and-set
mutexes are expected to have pretty unbalanced acquisition behavior
like you describe, especially under contention and in pre-emption
scenarios. This doesn't say anything about the wait-freedom of a LOCK
CMPXCHG instruction.

Some performance graphs (acquisitions / second against contending
cores) and cache state transition diagrams for various spinlock
implementations are available in a presentation[2] I gave with a
colleague (the maintainer of ConcurrencyKit[3]) at JHU a few years
ago.

I also doubt this would manifest in wait-free or lock-free data
structures like queues or ring buffers where a majority of the
instructions are not LOCK-prefixed. Especially since the code around
the insert/remove is unlikely to be locked and is likely to be fenced.
Additionally because the LOCK-prefixed instructions for a wait-free
data structure have no retry loops and the LOCK-prefixed instructions
in lock-free data structures branch precisely because they are
implemented as retry loops.

For ainc() specifically, unless it was inlined (which ISTR the Plan 9
C compilers don't do, but you'd know that way better than me), I can't
imagine that screwing things up. The MOV's can't be LOCK-prepended
anyway (nor do they deal with memory), and this gives other processors
time to do cache coherency traffic.

> this would be hidden waiting behavior that might lead to surprising
> (lack of) performance.
>
> i would say this vaguery could impede worst-cast analysis for safety
> critical systems, but you'd be pretty adventursome, to say the least,
> to use a processor with SMM in such a system.

Haha, indeed.

> - erik

--dho

[1] https://software.intel.com/sites/products/collateral/hpc/vtune/performa=
nce_analysis_guide.pdf
[2] https://9vx.org/~dho/presentations/Spinlocks.pdf
[3] http://concurrencykit.org