9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] waitfree
@ 2014-05-19 17:34 erik quanstrom
  2014-05-19 19:49 ` Devon H. O'Dell
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2014-05-19 17:34 UTC (permalink / raw)
  To: 9fans

i've been thinking about ainc() and for the amd64 implementation,

	TEXT ainc(SB), 1, $-4			/* int ainc(int*) */
		MOVL	$1, AX
		LOCK; XADDL AX, (RARG)
		ADDL	$1, AX
		RET

does anyone know if the architecture says this is wait-free?
this boils down to exactly how LOCK works, and i can't find
a good-enough definition.  i tend to think that it might not be.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 17:34 [9fans] waitfree erik quanstrom
@ 2014-05-19 19:49 ` Devon H. O'Dell
  2014-05-19 20:21   ` erik quanstrom
  0 siblings, 1 reply; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-19 19:49 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

The LOCK prefix is effectively a tiny, tiny mutex, but for all intents
and purposes, this is wait-free. The LOCK prefix forces N processors
to synchronize on bus and cache operations and this is how there is a
guarantee of an atomic read or an atomic write. For instructions like
cmpxchg and xadd where reads and writes are implied, the bus is locked
for the duration of the instruction.

Wait-freedom provides a guarantee on an upper-bound for completion.
The pipeline will not be reordered around a LOCK instruction, so your
instruction will cause a pipeline stall. But you are guaranteed to be
bounded on the time to complete the instruction, and the instruction
decomposed into μops will not be preempted as the μops are executed.

Wait-freedom is defined by every operation having a bound on the
number of steps required before the operation completes. In this case,
you are bound by the number of μops of XADDL + latency to memory. This
is a finite number, so this is wait-freedom.

Lock-freedom guarantees per-thread progress, but does allow threads to
starve. LOCK XADDL will always complete in bounded time, but a while
(cmpxchg(&ptr, oldptr, newptr)) loop does not. However, you are
guaranteed that at least one thread will always make progress through
this loop: preemption of a thread does not cause other threads to
starve. (This is how lock-freedom differs from mutex-based critical
sections although they sound like they make the same guarantee.)

Interestingly, there is also a recent academic paper[1] that shows how
most lock-free algorithms are practically wait-free. That's a bit of a
tangent, because LOCK XADDL is defined to be wait-free anyway, but it
is good to know.

--dho

[1]: http://arxiv.org/abs/1311.3200

2014-05-19 13:34 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
> i've been thinking about ainc() and for the amd64 implementation,
>
>         TEXT ainc(SB), 1, $-4                   /* int ainc(int*) */
>                 MOVL    $1, AX
>                 LOCK; XADDL AX, (RARG)
>                 ADDL    $1, AX
>                 RET
>
> does anyone know if the architecture says this is wait-free?
> this boils down to exactly how LOCK works, and i can't find
> a good-enough definition.  i tend to think that it might not be.
>
> - erik
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 19:49 ` Devon H. O'Dell
@ 2014-05-19 20:21   ` erik quanstrom
  2014-05-19 21:01     ` Devon H. O'Dell
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2014-05-19 20:21 UTC (permalink / raw)
  To: 9fans

On Mon May 19 15:51:27 EDT 2014, devon.odell@gmail.com wrote:
> The LOCK prefix is effectively a tiny, tiny mutex, but for all intents
> and purposes, this is wait-free. The LOCK prefix forces N processors
> to synchronize on bus and cache operations and this is how there is a
> guarantee of an atomic read or an atomic write. For instructions like
> cmpxchg and xadd where reads and writes are implied, the bus is locked
> for the duration of the instruction.
> 
> Wait-freedom provides a guarantee on an upper-bound for completion.
> The pipeline will not be reordered around a LOCK instruction, so your
> instruction will cause a pipeline stall. But you are guaranteed to be
> bounded on the time to complete the instruction, and the instruction
> decomposed into μops will not be preempted as the μops are executed.

there is no bus.

what LOCK really does is invoke part of the MSEI protocol.  the state
diagrams i've seen do not specifiy how this is arbitrated if there are > 1
processor trying to gain exclusive access to the same cacheline.

> Wait-freedom is defined by every operation having a bound on the
> number of steps required before the operation completes. In this case,
> you are bound by the number of μops of XADDL + latency to memory. This
> is a finite number, so this is wait-freedom.

i'm worried about the bound on the number of MSEI rounds.  i don't see
where the memory coherency protocol states that if there are n processors
a cacheline will be acquired in at most f(n) rounds.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 20:21   ` erik quanstrom
@ 2014-05-19 21:01     ` Devon H. O'Dell
  2014-05-19 22:05       ` erik quanstrom
  0 siblings, 1 reply; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-19 21:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

So you seem to be worried that N processors in a tight loop of LOCK
XADD could have a single processor. This isn't a problem because
locked instructions have total order. Section 8.2.3.8:

"The memory-ordering model ensures that all processors agree on a
single execution order of all locked instructions, including those
that are larger than 8 bytes or are not naturally aligned."



2014-05-19 16:21 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
>
> On Mon May 19 15:51:27 EDT 2014, devon.odell@gmail.com wrote:
> > The LOCK prefix is effectively a tiny, tiny mutex, but for all intents
> > and purposes, this is wait-free. The LOCK prefix forces N processors
> > to synchronize on bus and cache operations and this is how there is a
> > guarantee of an atomic read or an atomic write. For instructions like
> > cmpxchg and xadd where reads and writes are implied, the bus is locked
> > for the duration of the instruction.
> >
> > Wait-freedom provides a guarantee on an upper-bound for completion.
> > The pipeline will not be reordered around a LOCK instruction, so your
> > instruction will cause a pipeline stall. But you are guaranteed to be
> > bounded on the time to complete the instruction, and the instruction
> > decomposed into μops will not be preempted as the μops are executed.
>
> there is no bus.

What you mean is that the processor might not LOCK#, but it can. LOCK
will signal LOCK# on the bus if it thinks it has to. (And if whatever
is being ainc()'ed is ainc()'ed frequently, or whatever is being
ainc()'ed is just never in cache for whatever reason, that will
happen.)

> what LOCK really does is invoke part of the MSEI protocol.  the state
> diagrams i've seen do not specifiy how this is arbitrated if there are > 1
> processor trying to gain exclusive access to the same cacheline.
>
> > Wait-freedom is defined by every operation having a bound on the
> > number of steps required before the operation completes. In this case,
> > you are bound by the number of μops of XADDL + latency to memory. This
> > is a finite number, so this is wait-freedom.
>
> i'm worried about the bound on the number of MSEI rounds.  i don't see
> where the memory coherency protocol states that if there are n processors
> a cacheline will be acquired in at most f(n) rounds.

In Section 8.2.3.8 of the manual:

"8.2.3.8 Locked Instructions Have a Total Order

The memory-ordering model ensures that all processors agree on a
single execution order of all locked instructions, including those
that are larger than 8 bytes or are not naturally aligned."

Thus, LOCK-prefixed instructions will never cause processor
starvation, which seems to be your worry. They're wait-free.

--dho

> - erik
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 21:01     ` Devon H. O'Dell
@ 2014-05-19 22:05       ` erik quanstrom
  2014-05-19 22:14         ` ron minnich
  2014-05-20  1:10         ` Devon H. O'Dell
  0 siblings, 2 replies; 14+ messages in thread
From: erik quanstrom @ 2014-05-19 22:05 UTC (permalink / raw)
  To: 9fans

On Mon May 19 17:02:57 EDT 2014, devon.odell@gmail.com wrote:
> So you seem to be worried that N processors in a tight loop of LOCK
> XADD could have a single processor. This isn't a problem because
> locked instructions have total order. Section 8.2.3.8:
>
> "The memory-ordering model ensures that all processors agree on a
> single execution order of all locked instructions, including those
> that are larger than 8 bytes or are not naturally aligned."

i don't think this solves any problems.  given thread 0-n all executing
LOCK instructions, here's a valid ordering:

0	1	2		n
lock	stall	stall	...	stall
lock	stall	stall	...	stall
...			...
lock	stall	stall	...	stall

i'm not sure if the LOCK really changes the situation.  any old exclusive
cacheline access should do?

the documentation appears not to cover this completely.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 22:05       ` erik quanstrom
@ 2014-05-19 22:14         ` ron minnich
  2014-05-19 22:18           ` erik quanstrom
  2014-05-20  1:10         ` Devon H. O'Dell
  1 sibling, 1 reply; 14+ messages in thread
From: ron minnich @ 2014-05-19 22:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, May 19, 2014 at 3:05 PM, erik quanstrom <quanstro@quanstro.net> wrote:

> the documentation appears not to cover this completely.


Hmm. You put documentation and completely in the same sentence. Agents
are converging on your house. Run!

There's always an undocumented progress engine somewhere in the works.

ron



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 22:14         ` ron minnich
@ 2014-05-19 22:18           ` erik quanstrom
  0 siblings, 0 replies; 14+ messages in thread
From: erik quanstrom @ 2014-05-19 22:18 UTC (permalink / raw)
  To: 9fans

On Mon May 19 18:15:21 EDT 2014, rminnich@gmail.com wrote:
> On Mon, May 19, 2014 at 3:05 PM, erik quanstrom <quanstro@quanstro.net> wrote:
>
> > the documentation appears not to cover this completely.
>
>
> Hmm. You put documentation and completely in the same sentence. Agents
> are converging on your house. Run!
>
> There's always an undocumented progress engine somewhere in the works.

true.  the internet does appear to run on the infinite undocumentation drive.
(sorry mr adams.)

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-19 22:05       ` erik quanstrom
  2014-05-19 22:14         ` ron minnich
@ 2014-05-20  1:10         ` Devon H. O'Dell
  2014-05-20  2:12           ` erik quanstrom
  1 sibling, 1 reply; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-20  1:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2014-05-19 18:05 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
> On Mon May 19 17:02:57 EDT 2014, devon.odell@gmail.com wrote:
>> So you seem to be worried that N processors in a tight loop of LOCK
>> XADD could have a single processor. This isn't a problem because
>> locked instructions have total order. Section 8.2.3.8:
>>
>> "The memory-ordering model ensures that all processors agree on a
>> single execution order of all locked instructions, including those
>> that are larger than 8 bytes or are not naturally aligned."
>
> i don't think this solves any problems.  given thread 0-n all executing
> LOCK instructions, here's a valid ordering:
>
> 0       1       2               n
> lock    stall   stall   ...     stall
> lock    stall   stall   ...     stall
> ...                     ...
> lock    stall   stall   ...     stall
>
> i'm not sure if the LOCK really changes the situation.  any old exclusive
> cacheline access should do?

It is an ordering, but I don't think it's a valid one: your ellipses
suggest an unbounded execution time (given the context of the
discussion). I don't think that's valid because the protocol can't
possibly negotiate execution for more instructions than it has space
for in its pipeline. Furthermore, the pipeline cannot possibly be
filled with LOCK-prefixed instructions because it also needs to
schedule instruction loading, and it pipelines μops, not whole
instructions anyway. Furthermore, part of the execution cycle is
decomposing an instruction into its μop parts. At some point, that
processor is not going to be executing a LOCK instruction, it is going
to be executing some other μop (like decoding the next LOCK-prefixed
instruction it wants to execute). This won't be done with any
synchronization. When this happens, other processors will execute
their LOCK-prefixed instructions.

The only way I could think to try to force this execution history was
to unroll a loop of LOCK-prefixed instructions. In a tight loop, a
program I wrote to do LOCK XADD 10 billion times per-thread (across 4
threads on my 4 core system) finished with a standard deviation in
cycle count of around 1%. When I unroll the loop enough to fill the
pipeline, the stddev actually decreases (to about 0.5%), which leads
me to believe that the processor actively mitigates that sort of
instruction "attack" for highly concurrent workloads.

So either way, you're still bounded. Eventually p0 has to go do
something that isn't a LOCK-prefixed instruction, like decode the next
one. I don't know how to get the execution order you suggest. You'd
have to manage to fill the pipeline on the processor while starving
the pipeline on the others and preventing them from executing any
further instructions. Instruction load and decode stages are shared,
so I really don't see how you'd manage this without using PAUSE
strategically. You'd have to con the processor into executing that
order. At that point, just use a mutex :)

--dho

> the documentation appears not to cover this completely.
>
> - erik
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20  1:10         ` Devon H. O'Dell
@ 2014-05-20  2:12           ` erik quanstrom
  2014-05-20 14:47             ` Devon H. O'Dell
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2014-05-20  2:12 UTC (permalink / raw)
  To: 9fans

> It is an ordering, but I don't think it's a valid one: your ellipses
> suggest an unbounded execution time (given the context of the
> discussion). I don't think that's valid because the protocol can't
> possibly negotiate execution for more instructions than it has space
> for in its pipeline. Furthermore, the pipeline cannot possibly be

yes it is an unbounded set of instruction.  i am wondering if it isn't
possible for the same core to keep winning the MESI(F) arbitration.

i don't see tying µ-ops to cachelines.  load/store buffers i believe
is where cachelines come in to play.

> filled with LOCK-prefixed instructions because it also needs to
> schedule instruction loading, and it pipelines μops, not whole

i didn't read that in the arch guide.  where did you see this?

> instructions anyway. Furthermore, part of the execution cycle is
> decomposing an instruction into its μop parts. At some point, that
> processor is not going to be executing a LOCK instruction, it is going
> to be executing some other μop (like decoding the next LOCK-prefixed
> instruction it wants to execute). This won't be done with any
> synchronization. When this happens, other processors will execute
> their LOCK-prefixed instructions.

and this is an additional assumtion that i was trying to avoid.  i'm
interested if LOCK XADD is wait free in a theory.

> further instructions. Instruction load and decode stages are shared,

this is not always true.  and i think hints at the issue that it might be
inaccurate to generalize from your cpu to all MESI cpus.

i get a 126% difference executing lock xadd 1024*1024 times
with no branches using cores 4-7 of a xeon e3-1230.  i'm sure it would
be quite a bit more impressive if it were a bit easier to turn the timer
interrupt off.

i really wish i had a four package system to play with right now.  that
could yield some really fun numbers.  :-)

- erik

example run.  output core/cycles.
; 6.lxac
4 152880511
7 288660939
6 320991900
5 338755451




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20  2:12           ` erik quanstrom
@ 2014-05-20 14:47             ` Devon H. O'Dell
  2014-05-20 15:41               ` erik quanstrom
  0 siblings, 1 reply; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-20 14:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2014-05-19 22:12 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
> i get a 126% difference executing lock xadd 1024*1024 times
> with no branches using cores 4-7 of a xeon e3-1230.  i'm sure it would
> be quite a bit more impressive if it were a bit easier to turn the timer
> interrupt off.

Dunno what to say. I'm not trying this on Plan 9, and I can't
reproduce your results on an i7 or an e5-2690. I'm certainly not
claiming that all pipelines, processors, and caches are equal, but
I've simply never seen this behavior. I also can't think of an
application in which one would want to execute a million consecutive
LOCK-prefixed instructions. Perhaps I just lack imagination.

--dho

> i really wish i had a four package system to play with right now.  that
> could yield some really fun numbers.  :-)
>
> - erik
>
> example run.  output core/cycles.
> ; 6.lxac
> 4 152880511
> 7 288660939
> 6 320991900
> 5 338755451
>
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20 14:47             ` Devon H. O'Dell
@ 2014-05-20 15:41               ` erik quanstrom
  2014-05-20 19:14                 ` Devon H. O'Dell
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2014-05-20 15:41 UTC (permalink / raw)
  To: 9fans

> Dunno what to say. I'm not trying this on Plan 9, and I can't
> reproduce your results on an i7 or an e5-2690. I'm certainly not
> claiming that all pipelines, processors, and caches are equal, but
> I've simply never seen this behavior. I also can't think of an
> application in which one would want to execute a million consecutive
> LOCK-prefixed instructions. Perhaps I just lack imagination.

the original question was, are locked instructions wait free.  the
purpose was to see if there could be more than a few percent
variation, and it appears there can be.

i think in modern intel microarches this question boils down to,
can a given cpu thread in an arbitrary topology transition an
arbitrary cacheline from an arbitrary state to state E in a bounded
number of QPI cycles.

of course this reformulation isn't particular helpful, at least to me,
given the vagueness in the descriptions i've seen.

this is practically important because if there is a dogpile lock or
bit of shared memory in the system, then a particular cpu may
end up waiting quite a long time to acquire the lock.  it may
even get starved out for a bit, perhaps with a subset of other cpus
"batting around" more than once.

this would be hidden waiting behavior that might lead to surprising
(lack of) performance.

i would say this vaguery could impede worst-cast analysis for safety
critical systems, but you'd be pretty adventursome, to say the least,
to use a processor with SMM in such a system.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20 15:41               ` erik quanstrom
@ 2014-05-20 19:14                 ` Devon H. O'Dell
  2014-05-20 19:30                   ` erik quanstrom
  0 siblings, 1 reply; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-20 19:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2014-05-20 11:41 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
>> Dunno what to say. I'm not trying this on Plan 9, and I can't
>> reproduce your results on an i7 or an e5-2690. I'm certainly not
>> claiming that all pipelines, processors, and caches are equal, but
>> I've simply never seen this behavior. I also can't think of an
>> application in which one would want to execute a million consecutive
>> LOCK-prefixed instructions. Perhaps I just lack imagination.
>
> the original question was, are locked instructions wait free.  the
> purpose was to see if there could be more than a few percent
> variation, and it appears there can be.

Your results suggest so, but I'm not able to reproduce your results
and I have not seen behavior like this in the applications I work on.
Also just tried on an X3450 with no luck. I think this is a problem
with the benchmark and not suggestive of behavior of the processor.

Did you use a semaphore to ensure the threads all start execution at
roughly the same time? If not, I can explain away the result as the
process on CPU 4 gets to start with no contention on the memory bus;
it is always able to get the value from cache because nothing has
modified that address since it broadcast its modification. Cores 5, 6,
and 7 start with progressively more contention to main memory, and
consequently take longer to finish due continually loading from main
memory for a larger percentage of the 1MM reads they do.

If that's the case, it's not a fair benchmark.

> i think in modern intel microarches this question boils down to,
> can a given cpu thread in an arbitrary topology transition an
> arbitrary cacheline from an arbitrary state to state E in a bounded
> number of QPI cycles.

>From [1], emphasis mine:
    A cacheline can be put in an Exclusive state (E) in response to a “read for
    ownership” (RFO) in order to store a value. **All instructions
containing a lock prefix will
    result in a (RFO) since they always result in a write to the cache line.**

The same document continues on to mention how the read is performed,
and what the behavior is when line fill buffers and load / store
buffers are filled. In these cases, there is total ordering: the core
is no longer permitted to "issue μops to
the RS and OOO engine. This is the same mechanism as used in Core™2
processors to maintain pipeline consistency."

But that the LOCK prefix makes the read μop do RFO is useful to know
so we can test the actual behavior. Your question can be restated:

Is RFO implemented as an exchange retry loop?

I can't think of any reason it should be implemented in that way as
long as the cache protocol has a total order (which it must given that
the μops that generate the cache coherency protocol traffic have a
total order), a state transition from X to E can be done in a bounded
number of cycles.

I'll assume for the sake of argument that it isn't bounded and is in
fact implemented as a retry loop. If the read μop from the decomposed
instruction would loop trying to get the state to E, it would be
implemented as something like:

cacheline_t RFO(ptr_t addr) {
  cache_state_t st = get_cache_state(addr);
  while (xchg(st, E) != st) {
    st = get_cache_state(addr);
  }
  return read(addr);
}

The read function will try to find a value for addr in cache, then
from memory. If the LOCK-prefixed instruction's decomposed read μop
results in this behavior, a RFO miss can and will happen multiple
times. This will stall the pipeline for multiple memory lookups. You
can detect this with pipeline stall performance counters that will be
measurably (with significance) higher on the starved threads.
Otherwise, the pipeline stall counter should closely match the RFO
miss and cache miss counters.

> of course this reformulation isn't particular helpful, at least to me,
> given the vagueness in the descriptions i've seen.
>
> this is practically important because if there is a dogpile lock or
> bit of shared memory in the system, then a particular cpu may
> end up waiting quite a long time to acquire the lock.  it may
> even get starved out for a bit, perhaps with a subset of other cpus
> "batting around" more than once.

If your benchmark is valid, it still required you to execute 1MM
lock-prefixed instructions in order without branching. This will never
happen with spinlocks or wait- and lock-free data structures. Most of
the instructions for a fair lock (ticket, MCS, Anderson, etc.) are not
locking; most of the instructions for a test-and-test-and-set mutex
with exponential backoff are not locking, and naive test-and-set
mutexes are expected to have pretty unbalanced acquisition behavior
like you describe, especially under contention and in pre-emption
scenarios. This doesn't say anything about the wait-freedom of a LOCK
CMPXCHG instruction.

Some performance graphs (acquisitions / second against contending
cores) and cache state transition diagrams for various spinlock
implementations are available in a presentation[2] I gave with a
colleague (the maintainer of ConcurrencyKit[3]) at JHU a few years
ago.

I also doubt this would manifest in wait-free or lock-free data
structures like queues or ring buffers where a majority of the
instructions are not LOCK-prefixed. Especially since the code around
the insert/remove is unlikely to be locked and is likely to be fenced.
Additionally because the LOCK-prefixed instructions for a wait-free
data structure have no retry loops and the LOCK-prefixed instructions
in lock-free data structures branch precisely because they are
implemented as retry loops.

For ainc() specifically, unless it was inlined (which ISTR the Plan 9
C compilers don't do, but you'd know that way better than me), I can't
imagine that screwing things up. The MOV's can't be LOCK-prepended
anyway (nor do they deal with memory), and this gives other processors
time to do cache coherency traffic.

> this would be hidden waiting behavior that might lead to surprising
> (lack of) performance.
>
> i would say this vaguery could impede worst-cast analysis for safety
> critical systems, but you'd be pretty adventursome, to say the least,
> to use a processor with SMM in such a system.

Haha, indeed.

> - erik

--dho

[1] https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[2] https://9vx.org/~dho/presentations/Spinlocks.pdf
[3] http://concurrencykit.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20 19:14                 ` Devon H. O'Dell
@ 2014-05-20 19:30                   ` erik quanstrom
  2014-05-20 20:32                     ` Devon H. O'Dell
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2014-05-20 19:30 UTC (permalink / raw)
  To: 9fans

> I can't think of any reason it should be implemented in that way as
> long as the cache protocol has a total order (which it must given that
> the μops that generate the cache coherency protocol traffic have a
> total order), a state transition from X to E can be done in a bounded
> number of cycles.

my understanding is that in this context this only means that different
processors see the same order.  it doesn't say anything about fairness.

> The read function will try to find a value for addr in cache, then
> from memory. If the LOCK-prefixed instruction's decomposed read μop
> results in this behavior, a RFO miss can and will happen multiple
> times. This will stall the pipeline for multiple memory lookups. You
> can detect this with pipeline stall performance counters that will be
> measurably (with significance) higher on the starved threads.
> Otherwise, the pipeline stall counter should closely match the RFO
> miss and cache miss counters.

yes.

> For ainc() specifically, unless it was inlined (which ISTR the Plan 9
> C compilers don't do, but you'd know that way better than me), I can't
> imagine that screwing things up. The MOV's can't be LOCK-prepended
> anyway (nor do they deal with memory), and this gives other processors
> time to do cache coherency traffic.

it doesn't matter if this is hard to do.  if it is possible under any circumstances,
with any protcol-adhering implementation, then the assertion that amd64
lock is wait-free is false.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] waitfree
  2014-05-20 19:30                   ` erik quanstrom
@ 2014-05-20 20:32                     ` Devon H. O'Dell
  0 siblings, 0 replies; 14+ messages in thread
From: Devon H. O'Dell @ 2014-05-20 20:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2014-05-20 15:30 GMT-04:00 erik quanstrom <quanstro@quanstro.net>:
>> I can't think of any reason it should be implemented in that way as
>> long as the cache protocol has a total order (which it must given that
>> the μops that generate the cache coherency protocol traffic have a
>> total order), a state transition from X to E can be done in a bounded
>> number of cycles.
>
> my understanding is that in this context this only means that different
> processors see the same order.  it doesn't say anything about fairness.

You're right, it doesn't. [1] explores fairness of the GQ on Nehalem
and Westmere and concludes:

"[...] if the GQ is contended, the Nehalem mi-
croarchitecture is unfair towards local cores (vs. the QPI), as
the performance degradation local cores experience is larger
than that of the QPI. Still, this behavior is reasonable as the
GQ does not allow remote cores to starve, and thus it avoids
further aggravating the penalty of remote memory accesses.
Nevertheless, this property of the Nehalem is undocumented
and can be discovered only with experimental evaluation."

It continues to conclude that Westmere behaves in the same fashion.
However, although unfair, the research shows the performance is
bounded on available bandwidth of GQ and QPI. It shows these bounds
can favor remote access, leading us to wonder about cache locality
arguments under contention). However, that it shows there are bounds
is enough for it to show that the cache coherency protocol is
wait-free. Wait-freedom does not imply fairness.

See also [2] for more pretty graphs and some additional information if
you can view ppt slides.

>> For ainc() specifically, unless it was inlined (which ISTR the Plan 9
>> C compilers don't do, but you'd know that way better than me), I can't
>> imagine that screwing things up. The MOV's can't be LOCK-prepended
>> anyway (nor do they deal with memory), and this gives other processors
>> time to do cache coherency traffic.
>
> it doesn't matter if this is hard to do.  if it is possible under any circumstances,
> with any protcol-adhering implementation, then the assertion that amd64
> lock is wait-free is false.

I disagree if you are arguing that any general implementation is only
lock-free because the protocol can be wait-free. I will agree it is
possible for a specific implementation provided you can prove that
implementation can starve local or remote requests. But at least for
Westmere and Nehalem, [1] shows wait-freedom.

> - erik

--dho

[1] http://people.inf.ethz.ch/zmajo/publications/11-systor.pdf
[2] http://people.inf.ethz.ch/zmajo/public_talks/2011-05-31-systor_2011.ppt



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-05-20 20:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-19 17:34 [9fans] waitfree erik quanstrom
2014-05-19 19:49 ` Devon H. O'Dell
2014-05-19 20:21   ` erik quanstrom
2014-05-19 21:01     ` Devon H. O'Dell
2014-05-19 22:05       ` erik quanstrom
2014-05-19 22:14         ` ron minnich
2014-05-19 22:18           ` erik quanstrom
2014-05-20  1:10         ` Devon H. O'Dell
2014-05-20  2:12           ` erik quanstrom
2014-05-20 14:47             ` Devon H. O'Dell
2014-05-20 15:41               ` erik quanstrom
2014-05-20 19:14                 ` Devon H. O'Dell
2014-05-20 19:30                   ` erik quanstrom
2014-05-20 20:32                     ` Devon H. O'Dell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).