From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <3d8d9b061e102356bda61520e1682072@ivey> References: <1203fd06e0a13df9c23d4a5e1ac1aad0@ladd.quanstro.net> <3d8d9b061e102356bda61520e1682072@ivey> Date: Tue, 20 May 2014 15:14:32 -0400 Message-ID: From: "Devon H. O'Dell" To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] waitfree Topicbox-Message-UUID: ea51a326-ead8-11e9-9d60-3106f5b1d025 2014-05-20 11:41 GMT-04:00 erik quanstrom : >> Dunno what to say. I'm not trying this on Plan 9, and I can't >> reproduce your results on an i7 or an e5-2690. I'm certainly not >> claiming that all pipelines, processors, and caches are equal, but >> I've simply never seen this behavior. I also can't think of an >> application in which one would want to execute a million consecutive >> LOCK-prefixed instructions. Perhaps I just lack imagination. > > the original question was, are locked instructions wait free. the > purpose was to see if there could be more than a few percent > variation, and it appears there can be. Your results suggest so, but I'm not able to reproduce your results and I have not seen behavior like this in the applications I work on. Also just tried on an X3450 with no luck. I think this is a problem with the benchmark and not suggestive of behavior of the processor. Did you use a semaphore to ensure the threads all start execution at roughly the same time? If not, I can explain away the result as the process on CPU 4 gets to start with no contention on the memory bus; it is always able to get the value from cache because nothing has modified that address since it broadcast its modification. Cores 5, 6, and 7 start with progressively more contention to main memory, and consequently take longer to finish due continually loading from main memory for a larger percentage of the 1MM reads they do. If that's the case, it's not a fair benchmark. > i think in modern intel microarches this question boils down to, > can a given cpu thread in an arbitrary topology transition an > arbitrary cacheline from an arbitrary state to state E in a bounded > number of QPI cycles. >>From [1], emphasis mine: A cacheline can be put in an Exclusive state (E) in response to a =E2= =80=9Cread for ownership=E2=80=9D (RFO) in order to store a value. **All instructions containing a lock prefix will result in a (RFO) since they always result in a write to the cache line= .** The same document continues on to mention how the read is performed, and what the behavior is when line fill buffers and load / store buffers are filled. In these cases, there is total ordering: the core is no longer permitted to "issue =CE=BCops to the RS and OOO engine. This is the same mechanism as used in Core=E2=84=A22 processors to maintain pipeline consistency." But that the LOCK prefix makes the read =CE=BCop do RFO is useful to know so we can test the actual behavior. Your question can be restated: Is RFO implemented as an exchange retry loop? I can't think of any reason it should be implemented in that way as long as the cache protocol has a total order (which it must given that the =CE=BCops that generate the cache coherency protocol traffic have a total order), a state transition from X to E can be done in a bounded number of cycles. I'll assume for the sake of argument that it isn't bounded and is in fact implemented as a retry loop. If the read =CE=BCop from the decomposed instruction would loop trying to get the state to E, it would be implemented as something like: cacheline_t RFO(ptr_t addr) { cache_state_t st =3D get_cache_state(addr); while (xchg(st, E) !=3D st) { st =3D get_cache_state(addr); } return read(addr); } The read function will try to find a value for addr in cache, then from memory. If the LOCK-prefixed instruction's decomposed read =CE=BCop results in this behavior, a RFO miss can and will happen multiple times. This will stall the pipeline for multiple memory lookups. You can detect this with pipeline stall performance counters that will be measurably (with significance) higher on the starved threads. Otherwise, the pipeline stall counter should closely match the RFO miss and cache miss counters. > of course this reformulation isn't particular helpful, at least to me, > given the vagueness in the descriptions i've seen. > > this is practically important because if there is a dogpile lock or > bit of shared memory in the system, then a particular cpu may > end up waiting quite a long time to acquire the lock. it may > even get starved out for a bit, perhaps with a subset of other cpus > "batting around" more than once. If your benchmark is valid, it still required you to execute 1MM lock-prefixed instructions in order without branching. This will never happen with spinlocks or wait- and lock-free data structures. Most of the instructions for a fair lock (ticket, MCS, Anderson, etc.) are not locking; most of the instructions for a test-and-test-and-set mutex with exponential backoff are not locking, and naive test-and-set mutexes are expected to have pretty unbalanced acquisition behavior like you describe, especially under contention and in pre-emption scenarios. This doesn't say anything about the wait-freedom of a LOCK CMPXCHG instruction. Some performance graphs (acquisitions / second against contending cores) and cache state transition diagrams for various spinlock implementations are available in a presentation[2] I gave with a colleague (the maintainer of ConcurrencyKit[3]) at JHU a few years ago. I also doubt this would manifest in wait-free or lock-free data structures like queues or ring buffers where a majority of the instructions are not LOCK-prefixed. Especially since the code around the insert/remove is unlikely to be locked and is likely to be fenced. Additionally because the LOCK-prefixed instructions for a wait-free data structure have no retry loops and the LOCK-prefixed instructions in lock-free data structures branch precisely because they are implemented as retry loops. For ainc() specifically, unless it was inlined (which ISTR the Plan 9 C compilers don't do, but you'd know that way better than me), I can't imagine that screwing things up. The MOV's can't be LOCK-prepended anyway (nor do they deal with memory), and this gives other processors time to do cache coherency traffic. > this would be hidden waiting behavior that might lead to surprising > (lack of) performance. > > i would say this vaguery could impede worst-cast analysis for safety > critical systems, but you'd be pretty adventursome, to say the least, > to use a processor with SMM in such a system. Haha, indeed. > - erik --dho [1] https://software.intel.com/sites/products/collateral/hpc/vtune/performa= nce_analysis_guide.pdf [2] https://9vx.org/~dho/presentations/Spinlocks.pdf [3] http://concurrencykit.org