* [9fans] waitfree @ 2014-05-19 17:34 erik quanstrom 2014-05-19 19:49 ` Devon H. O'Dell 0 siblings, 1 reply; 14+ messages in thread From: erik quanstrom @ 2014-05-19 17:34 UTC (permalink / raw) To: 9fans i've been thinking about ainc() and for the amd64 implementation, TEXT ainc(SB), 1, $-4 /* int ainc(int*) */ MOVL $1, AX LOCK; XADDL AX, (RARG) ADDL $1, AX RET does anyone know if the architecture says this is wait-free? this boils down to exactly how LOCK works, and i can't find a good-enough definition. i tend to think that it might not be. - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 17:34 [9fans] waitfree erik quanstrom @ 2014-05-19 19:49 ` Devon H. O'Dell 2014-05-19 20:21 ` erik quanstrom 0 siblings, 1 reply; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-19 19:49 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs The LOCK prefix is effectively a tiny, tiny mutex, but for all intents and purposes, this is wait-free. The LOCK prefix forces N processors to synchronize on bus and cache operations and this is how there is a guarantee of an atomic read or an atomic write. For instructions like cmpxchg and xadd where reads and writes are implied, the bus is locked for the duration of the instruction. Wait-freedom provides a guarantee on an upper-bound for completion. The pipeline will not be reordered around a LOCK instruction, so your instruction will cause a pipeline stall. But you are guaranteed to be bounded on the time to complete the instruction, and the instruction decomposed into μops will not be preempted as the μops are executed. Wait-freedom is defined by every operation having a bound on the number of steps required before the operation completes. In this case, you are bound by the number of μops of XADDL + latency to memory. This is a finite number, so this is wait-freedom. Lock-freedom guarantees per-thread progress, but does allow threads to starve. LOCK XADDL will always complete in bounded time, but a while (cmpxchg(&ptr, oldptr, newptr)) loop does not. However, you are guaranteed that at least one thread will always make progress through this loop: preemption of a thread does not cause other threads to starve. (This is how lock-freedom differs from mutex-based critical sections although they sound like they make the same guarantee.) Interestingly, there is also a recent academic paper[1] that shows how most lock-free algorithms are practically wait-free. That's a bit of a tangent, because LOCK XADDL is defined to be wait-free anyway, but it is good to know. --dho [1]: http://arxiv.org/abs/1311.3200 2014-05-19 13:34 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: > i've been thinking about ainc() and for the amd64 implementation, > > TEXT ainc(SB), 1, $-4 /* int ainc(int*) */ > MOVL $1, AX > LOCK; XADDL AX, (RARG) > ADDL $1, AX > RET > > does anyone know if the architecture says this is wait-free? > this boils down to exactly how LOCK works, and i can't find > a good-enough definition. i tend to think that it might not be. > > - erik > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 19:49 ` Devon H. O'Dell @ 2014-05-19 20:21 ` erik quanstrom 2014-05-19 21:01 ` Devon H. O'Dell 0 siblings, 1 reply; 14+ messages in thread From: erik quanstrom @ 2014-05-19 20:21 UTC (permalink / raw) To: 9fans On Mon May 19 15:51:27 EDT 2014, devon.odell@gmail.com wrote: > The LOCK prefix is effectively a tiny, tiny mutex, but for all intents > and purposes, this is wait-free. The LOCK prefix forces N processors > to synchronize on bus and cache operations and this is how there is a > guarantee of an atomic read or an atomic write. For instructions like > cmpxchg and xadd where reads and writes are implied, the bus is locked > for the duration of the instruction. > > Wait-freedom provides a guarantee on an upper-bound for completion. > The pipeline will not be reordered around a LOCK instruction, so your > instruction will cause a pipeline stall. But you are guaranteed to be > bounded on the time to complete the instruction, and the instruction > decomposed into μops will not be preempted as the μops are executed. there is no bus. what LOCK really does is invoke part of the MSEI protocol. the state diagrams i've seen do not specifiy how this is arbitrated if there are > 1 processor trying to gain exclusive access to the same cacheline. > Wait-freedom is defined by every operation having a bound on the > number of steps required before the operation completes. In this case, > you are bound by the number of μops of XADDL + latency to memory. This > is a finite number, so this is wait-freedom. i'm worried about the bound on the number of MSEI rounds. i don't see where the memory coherency protocol states that if there are n processors a cacheline will be acquired in at most f(n) rounds. - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 20:21 ` erik quanstrom @ 2014-05-19 21:01 ` Devon H. O'Dell 2014-05-19 22:05 ` erik quanstrom 0 siblings, 1 reply; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-19 21:01 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs So you seem to be worried that N processors in a tight loop of LOCK XADD could have a single processor. This isn't a problem because locked instructions have total order. Section 8.2.3.8: "The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned." 2014-05-19 16:21 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: > > On Mon May 19 15:51:27 EDT 2014, devon.odell@gmail.com wrote: > > The LOCK prefix is effectively a tiny, tiny mutex, but for all intents > > and purposes, this is wait-free. The LOCK prefix forces N processors > > to synchronize on bus and cache operations and this is how there is a > > guarantee of an atomic read or an atomic write. For instructions like > > cmpxchg and xadd where reads and writes are implied, the bus is locked > > for the duration of the instruction. > > > > Wait-freedom provides a guarantee on an upper-bound for completion. > > The pipeline will not be reordered around a LOCK instruction, so your > > instruction will cause a pipeline stall. But you are guaranteed to be > > bounded on the time to complete the instruction, and the instruction > > decomposed into μops will not be preempted as the μops are executed. > > there is no bus. What you mean is that the processor might not LOCK#, but it can. LOCK will signal LOCK# on the bus if it thinks it has to. (And if whatever is being ainc()'ed is ainc()'ed frequently, or whatever is being ainc()'ed is just never in cache for whatever reason, that will happen.) > what LOCK really does is invoke part of the MSEI protocol. the state > diagrams i've seen do not specifiy how this is arbitrated if there are > 1 > processor trying to gain exclusive access to the same cacheline. > > > Wait-freedom is defined by every operation having a bound on the > > number of steps required before the operation completes. In this case, > > you are bound by the number of μops of XADDL + latency to memory. This > > is a finite number, so this is wait-freedom. > > i'm worried about the bound on the number of MSEI rounds. i don't see > where the memory coherency protocol states that if there are n processors > a cacheline will be acquired in at most f(n) rounds. In Section 8.2.3.8 of the manual: "8.2.3.8 Locked Instructions Have a Total Order The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned." Thus, LOCK-prefixed instructions will never cause processor starvation, which seems to be your worry. They're wait-free. --dho > - erik > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 21:01 ` Devon H. O'Dell @ 2014-05-19 22:05 ` erik quanstrom 2014-05-19 22:14 ` ron minnich 2014-05-20 1:10 ` Devon H. O'Dell 0 siblings, 2 replies; 14+ messages in thread From: erik quanstrom @ 2014-05-19 22:05 UTC (permalink / raw) To: 9fans On Mon May 19 17:02:57 EDT 2014, devon.odell@gmail.com wrote: > So you seem to be worried that N processors in a tight loop of LOCK > XADD could have a single processor. This isn't a problem because > locked instructions have total order. Section 8.2.3.8: > > "The memory-ordering model ensures that all processors agree on a > single execution order of all locked instructions, including those > that are larger than 8 bytes or are not naturally aligned." i don't think this solves any problems. given thread 0-n all executing LOCK instructions, here's a valid ordering: 0 1 2 n lock stall stall ... stall lock stall stall ... stall ... ... lock stall stall ... stall i'm not sure if the LOCK really changes the situation. any old exclusive cacheline access should do? the documentation appears not to cover this completely. - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 22:05 ` erik quanstrom @ 2014-05-19 22:14 ` ron minnich 2014-05-19 22:18 ` erik quanstrom 2014-05-20 1:10 ` Devon H. O'Dell 1 sibling, 1 reply; 14+ messages in thread From: ron minnich @ 2014-05-19 22:14 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs On Mon, May 19, 2014 at 3:05 PM, erik quanstrom <quanstro@quanstro.net> wrote: > the documentation appears not to cover this completely. Hmm. You put documentation and completely in the same sentence. Agents are converging on your house. Run! There's always an undocumented progress engine somewhere in the works. ron ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 22:14 ` ron minnich @ 2014-05-19 22:18 ` erik quanstrom 0 siblings, 0 replies; 14+ messages in thread From: erik quanstrom @ 2014-05-19 22:18 UTC (permalink / raw) To: 9fans On Mon May 19 18:15:21 EDT 2014, rminnich@gmail.com wrote: > On Mon, May 19, 2014 at 3:05 PM, erik quanstrom <quanstro@quanstro.net> wrote: > > > the documentation appears not to cover this completely. > > > Hmm. You put documentation and completely in the same sentence. Agents > are converging on your house. Run! > > There's always an undocumented progress engine somewhere in the works. true. the internet does appear to run on the infinite undocumentation drive. (sorry mr adams.) - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-19 22:05 ` erik quanstrom 2014-05-19 22:14 ` ron minnich @ 2014-05-20 1:10 ` Devon H. O'Dell 2014-05-20 2:12 ` erik quanstrom 1 sibling, 1 reply; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-20 1:10 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs 2014-05-19 18:05 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: > On Mon May 19 17:02:57 EDT 2014, devon.odell@gmail.com wrote: >> So you seem to be worried that N processors in a tight loop of LOCK >> XADD could have a single processor. This isn't a problem because >> locked instructions have total order. Section 8.2.3.8: >> >> "The memory-ordering model ensures that all processors agree on a >> single execution order of all locked instructions, including those >> that are larger than 8 bytes or are not naturally aligned." > > i don't think this solves any problems. given thread 0-n all executing > LOCK instructions, here's a valid ordering: > > 0 1 2 n > lock stall stall ... stall > lock stall stall ... stall > ... ... > lock stall stall ... stall > > i'm not sure if the LOCK really changes the situation. any old exclusive > cacheline access should do? It is an ordering, but I don't think it's a valid one: your ellipses suggest an unbounded execution time (given the context of the discussion). I don't think that's valid because the protocol can't possibly negotiate execution for more instructions than it has space for in its pipeline. Furthermore, the pipeline cannot possibly be filled with LOCK-prefixed instructions because it also needs to schedule instruction loading, and it pipelines μops, not whole instructions anyway. Furthermore, part of the execution cycle is decomposing an instruction into its μop parts. At some point, that processor is not going to be executing a LOCK instruction, it is going to be executing some other μop (like decoding the next LOCK-prefixed instruction it wants to execute). This won't be done with any synchronization. When this happens, other processors will execute their LOCK-prefixed instructions. The only way I could think to try to force this execution history was to unroll a loop of LOCK-prefixed instructions. In a tight loop, a program I wrote to do LOCK XADD 10 billion times per-thread (across 4 threads on my 4 core system) finished with a standard deviation in cycle count of around 1%. When I unroll the loop enough to fill the pipeline, the stddev actually decreases (to about 0.5%), which leads me to believe that the processor actively mitigates that sort of instruction "attack" for highly concurrent workloads. So either way, you're still bounded. Eventually p0 has to go do something that isn't a LOCK-prefixed instruction, like decode the next one. I don't know how to get the execution order you suggest. You'd have to manage to fill the pipeline on the processor while starving the pipeline on the others and preventing them from executing any further instructions. Instruction load and decode stages are shared, so I really don't see how you'd manage this without using PAUSE strategically. You'd have to con the processor into executing that order. At that point, just use a mutex :) --dho > the documentation appears not to cover this completely. > > - erik > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 1:10 ` Devon H. O'Dell @ 2014-05-20 2:12 ` erik quanstrom 2014-05-20 14:47 ` Devon H. O'Dell 0 siblings, 1 reply; 14+ messages in thread From: erik quanstrom @ 2014-05-20 2:12 UTC (permalink / raw) To: 9fans > It is an ordering, but I don't think it's a valid one: your ellipses > suggest an unbounded execution time (given the context of the > discussion). I don't think that's valid because the protocol can't > possibly negotiate execution for more instructions than it has space > for in its pipeline. Furthermore, the pipeline cannot possibly be yes it is an unbounded set of instruction. i am wondering if it isn't possible for the same core to keep winning the MESI(F) arbitration. i don't see tying µ-ops to cachelines. load/store buffers i believe is where cachelines come in to play. > filled with LOCK-prefixed instructions because it also needs to > schedule instruction loading, and it pipelines μops, not whole i didn't read that in the arch guide. where did you see this? > instructions anyway. Furthermore, part of the execution cycle is > decomposing an instruction into its μop parts. At some point, that > processor is not going to be executing a LOCK instruction, it is going > to be executing some other μop (like decoding the next LOCK-prefixed > instruction it wants to execute). This won't be done with any > synchronization. When this happens, other processors will execute > their LOCK-prefixed instructions. and this is an additional assumtion that i was trying to avoid. i'm interested if LOCK XADD is wait free in a theory. > further instructions. Instruction load and decode stages are shared, this is not always true. and i think hints at the issue that it might be inaccurate to generalize from your cpu to all MESI cpus. i get a 126% difference executing lock xadd 1024*1024 times with no branches using cores 4-7 of a xeon e3-1230. i'm sure it would be quite a bit more impressive if it were a bit easier to turn the timer interrupt off. i really wish i had a four package system to play with right now. that could yield some really fun numbers. :-) - erik example run. output core/cycles. ; 6.lxac 4 152880511 7 288660939 6 320991900 5 338755451 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 2:12 ` erik quanstrom @ 2014-05-20 14:47 ` Devon H. O'Dell 2014-05-20 15:41 ` erik quanstrom 0 siblings, 1 reply; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-20 14:47 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs 2014-05-19 22:12 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: > i get a 126% difference executing lock xadd 1024*1024 times > with no branches using cores 4-7 of a xeon e3-1230. i'm sure it would > be quite a bit more impressive if it were a bit easier to turn the timer > interrupt off. Dunno what to say. I'm not trying this on Plan 9, and I can't reproduce your results on an i7 or an e5-2690. I'm certainly not claiming that all pipelines, processors, and caches are equal, but I've simply never seen this behavior. I also can't think of an application in which one would want to execute a million consecutive LOCK-prefixed instructions. Perhaps I just lack imagination. --dho > i really wish i had a four package system to play with right now. that > could yield some really fun numbers. :-) > > - erik > > example run. output core/cycles. > ; 6.lxac > 4 152880511 > 7 288660939 > 6 320991900 > 5 338755451 > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 14:47 ` Devon H. O'Dell @ 2014-05-20 15:41 ` erik quanstrom 2014-05-20 19:14 ` Devon H. O'Dell 0 siblings, 1 reply; 14+ messages in thread From: erik quanstrom @ 2014-05-20 15:41 UTC (permalink / raw) To: 9fans > Dunno what to say. I'm not trying this on Plan 9, and I can't > reproduce your results on an i7 or an e5-2690. I'm certainly not > claiming that all pipelines, processors, and caches are equal, but > I've simply never seen this behavior. I also can't think of an > application in which one would want to execute a million consecutive > LOCK-prefixed instructions. Perhaps I just lack imagination. the original question was, are locked instructions wait free. the purpose was to see if there could be more than a few percent variation, and it appears there can be. i think in modern intel microarches this question boils down to, can a given cpu thread in an arbitrary topology transition an arbitrary cacheline from an arbitrary state to state E in a bounded number of QPI cycles. of course this reformulation isn't particular helpful, at least to me, given the vagueness in the descriptions i've seen. this is practically important because if there is a dogpile lock or bit of shared memory in the system, then a particular cpu may end up waiting quite a long time to acquire the lock. it may even get starved out for a bit, perhaps with a subset of other cpus "batting around" more than once. this would be hidden waiting behavior that might lead to surprising (lack of) performance. i would say this vaguery could impede worst-cast analysis for safety critical systems, but you'd be pretty adventursome, to say the least, to use a processor with SMM in such a system. - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 15:41 ` erik quanstrom @ 2014-05-20 19:14 ` Devon H. O'Dell 2014-05-20 19:30 ` erik quanstrom 0 siblings, 1 reply; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-20 19:14 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs 2014-05-20 11:41 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: >> Dunno what to say. I'm not trying this on Plan 9, and I can't >> reproduce your results on an i7 or an e5-2690. I'm certainly not >> claiming that all pipelines, processors, and caches are equal, but >> I've simply never seen this behavior. I also can't think of an >> application in which one would want to execute a million consecutive >> LOCK-prefixed instructions. Perhaps I just lack imagination. > > the original question was, are locked instructions wait free. the > purpose was to see if there could be more than a few percent > variation, and it appears there can be. Your results suggest so, but I'm not able to reproduce your results and I have not seen behavior like this in the applications I work on. Also just tried on an X3450 with no luck. I think this is a problem with the benchmark and not suggestive of behavior of the processor. Did you use a semaphore to ensure the threads all start execution at roughly the same time? If not, I can explain away the result as the process on CPU 4 gets to start with no contention on the memory bus; it is always able to get the value from cache because nothing has modified that address since it broadcast its modification. Cores 5, 6, and 7 start with progressively more contention to main memory, and consequently take longer to finish due continually loading from main memory for a larger percentage of the 1MM reads they do. If that's the case, it's not a fair benchmark. > i think in modern intel microarches this question boils down to, > can a given cpu thread in an arbitrary topology transition an > arbitrary cacheline from an arbitrary state to state E in a bounded > number of QPI cycles. >From [1], emphasis mine: A cacheline can be put in an Exclusive state (E) in response to a “read for ownership” (RFO) in order to store a value. **All instructions containing a lock prefix will result in a (RFO) since they always result in a write to the cache line.** The same document continues on to mention how the read is performed, and what the behavior is when line fill buffers and load / store buffers are filled. In these cases, there is total ordering: the core is no longer permitted to "issue μops to the RS and OOO engine. This is the same mechanism as used in Core™2 processors to maintain pipeline consistency." But that the LOCK prefix makes the read μop do RFO is useful to know so we can test the actual behavior. Your question can be restated: Is RFO implemented as an exchange retry loop? I can't think of any reason it should be implemented in that way as long as the cache protocol has a total order (which it must given that the μops that generate the cache coherency protocol traffic have a total order), a state transition from X to E can be done in a bounded number of cycles. I'll assume for the sake of argument that it isn't bounded and is in fact implemented as a retry loop. If the read μop from the decomposed instruction would loop trying to get the state to E, it would be implemented as something like: cacheline_t RFO(ptr_t addr) { cache_state_t st = get_cache_state(addr); while (xchg(st, E) != st) { st = get_cache_state(addr); } return read(addr); } The read function will try to find a value for addr in cache, then from memory. If the LOCK-prefixed instruction's decomposed read μop results in this behavior, a RFO miss can and will happen multiple times. This will stall the pipeline for multiple memory lookups. You can detect this with pipeline stall performance counters that will be measurably (with significance) higher on the starved threads. Otherwise, the pipeline stall counter should closely match the RFO miss and cache miss counters. > of course this reformulation isn't particular helpful, at least to me, > given the vagueness in the descriptions i've seen. > > this is practically important because if there is a dogpile lock or > bit of shared memory in the system, then a particular cpu may > end up waiting quite a long time to acquire the lock. it may > even get starved out for a bit, perhaps with a subset of other cpus > "batting around" more than once. If your benchmark is valid, it still required you to execute 1MM lock-prefixed instructions in order without branching. This will never happen with spinlocks or wait- and lock-free data structures. Most of the instructions for a fair lock (ticket, MCS, Anderson, etc.) are not locking; most of the instructions for a test-and-test-and-set mutex with exponential backoff are not locking, and naive test-and-set mutexes are expected to have pretty unbalanced acquisition behavior like you describe, especially under contention and in pre-emption scenarios. This doesn't say anything about the wait-freedom of a LOCK CMPXCHG instruction. Some performance graphs (acquisitions / second against contending cores) and cache state transition diagrams for various spinlock implementations are available in a presentation[2] I gave with a colleague (the maintainer of ConcurrencyKit[3]) at JHU a few years ago. I also doubt this would manifest in wait-free or lock-free data structures like queues or ring buffers where a majority of the instructions are not LOCK-prefixed. Especially since the code around the insert/remove is unlikely to be locked and is likely to be fenced. Additionally because the LOCK-prefixed instructions for a wait-free data structure have no retry loops and the LOCK-prefixed instructions in lock-free data structures branch precisely because they are implemented as retry loops. For ainc() specifically, unless it was inlined (which ISTR the Plan 9 C compilers don't do, but you'd know that way better than me), I can't imagine that screwing things up. The MOV's can't be LOCK-prepended anyway (nor do they deal with memory), and this gives other processors time to do cache coherency traffic. > this would be hidden waiting behavior that might lead to surprising > (lack of) performance. > > i would say this vaguery could impede worst-cast analysis for safety > critical systems, but you'd be pretty adventursome, to say the least, > to use a processor with SMM in such a system. Haha, indeed. > - erik --dho [1] https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf [2] https://9vx.org/~dho/presentations/Spinlocks.pdf [3] http://concurrencykit.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 19:14 ` Devon H. O'Dell @ 2014-05-20 19:30 ` erik quanstrom 2014-05-20 20:32 ` Devon H. O'Dell 0 siblings, 1 reply; 14+ messages in thread From: erik quanstrom @ 2014-05-20 19:30 UTC (permalink / raw) To: 9fans > I can't think of any reason it should be implemented in that way as > long as the cache protocol has a total order (which it must given that > the μops that generate the cache coherency protocol traffic have a > total order), a state transition from X to E can be done in a bounded > number of cycles. my understanding is that in this context this only means that different processors see the same order. it doesn't say anything about fairness. > The read function will try to find a value for addr in cache, then > from memory. If the LOCK-prefixed instruction's decomposed read μop > results in this behavior, a RFO miss can and will happen multiple > times. This will stall the pipeline for multiple memory lookups. You > can detect this with pipeline stall performance counters that will be > measurably (with significance) higher on the starved threads. > Otherwise, the pipeline stall counter should closely match the RFO > miss and cache miss counters. yes. > For ainc() specifically, unless it was inlined (which ISTR the Plan 9 > C compilers don't do, but you'd know that way better than me), I can't > imagine that screwing things up. The MOV's can't be LOCK-prepended > anyway (nor do they deal with memory), and this gives other processors > time to do cache coherency traffic. it doesn't matter if this is hard to do. if it is possible under any circumstances, with any protcol-adhering implementation, then the assertion that amd64 lock is wait-free is false. - erik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [9fans] waitfree 2014-05-20 19:30 ` erik quanstrom @ 2014-05-20 20:32 ` Devon H. O'Dell 0 siblings, 0 replies; 14+ messages in thread From: Devon H. O'Dell @ 2014-05-20 20:32 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs 2014-05-20 15:30 GMT-04:00 erik quanstrom <quanstro@quanstro.net>: >> I can't think of any reason it should be implemented in that way as >> long as the cache protocol has a total order (which it must given that >> the μops that generate the cache coherency protocol traffic have a >> total order), a state transition from X to E can be done in a bounded >> number of cycles. > > my understanding is that in this context this only means that different > processors see the same order. it doesn't say anything about fairness. You're right, it doesn't. [1] explores fairness of the GQ on Nehalem and Westmere and concludes: "[...] if the GQ is contended, the Nehalem mi- croarchitecture is unfair towards local cores (vs. the QPI), as the performance degradation local cores experience is larger than that of the QPI. Still, this behavior is reasonable as the GQ does not allow remote cores to starve, and thus it avoids further aggravating the penalty of remote memory accesses. Nevertheless, this property of the Nehalem is undocumented and can be discovered only with experimental evaluation." It continues to conclude that Westmere behaves in the same fashion. However, although unfair, the research shows the performance is bounded on available bandwidth of GQ and QPI. It shows these bounds can favor remote access, leading us to wonder about cache locality arguments under contention). However, that it shows there are bounds is enough for it to show that the cache coherency protocol is wait-free. Wait-freedom does not imply fairness. See also [2] for more pretty graphs and some additional information if you can view ppt slides. >> For ainc() specifically, unless it was inlined (which ISTR the Plan 9 >> C compilers don't do, but you'd know that way better than me), I can't >> imagine that screwing things up. The MOV's can't be LOCK-prepended >> anyway (nor do they deal with memory), and this gives other processors >> time to do cache coherency traffic. > > it doesn't matter if this is hard to do. if it is possible under any circumstances, > with any protcol-adhering implementation, then the assertion that amd64 > lock is wait-free is false. I disagree if you are arguing that any general implementation is only lock-free because the protocol can be wait-free. I will agree it is possible for a specific implementation provided you can prove that implementation can starve local or remote requests. But at least for Westmere and Nehalem, [1] shows wait-freedom. > - erik --dho [1] http://people.inf.ethz.ch/zmajo/publications/11-systor.pdf [2] http://people.inf.ethz.ch/zmajo/public_talks/2011-05-31-systor_2011.ppt ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-05-20 20:32 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-19 17:34 [9fans] waitfree erik quanstrom 2014-05-19 19:49 ` Devon H. O'Dell 2014-05-19 20:21 ` erik quanstrom 2014-05-19 21:01 ` Devon H. O'Dell 2014-05-19 22:05 ` erik quanstrom 2014-05-19 22:14 ` ron minnich 2014-05-19 22:18 ` erik quanstrom 2014-05-20 1:10 ` Devon H. O'Dell 2014-05-20 2:12 ` erik quanstrom 2014-05-20 14:47 ` Devon H. O'Dell 2014-05-20 15:41 ` erik quanstrom 2014-05-20 19:14 ` Devon H. O'Dell 2014-05-20 19:30 ` erik quanstrom 2014-05-20 20:32 ` Devon H. O'Dell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).