From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 19 May 2014 21:10:21 -0400 Message-ID: From: "Devon H. O'Dell" To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] waitfree Topicbox-Message-UUID: e9579a02-ead8-11e9-9d60-3106f5b1d025 2014-05-19 18:05 GMT-04:00 erik quanstrom : > On Mon May 19 17:02:57 EDT 2014, devon.odell@gmail.com wrote: >> So you seem to be worried that N processors in a tight loop of LOCK >> XADD could have a single processor. This isn't a problem because >> locked instructions have total order. Section 8.2.3.8: >> >> "The memory-ordering model ensures that all processors agree on a >> single execution order of all locked instructions, including those >> that are larger than 8 bytes or are not naturally aligned." > > i don't think this solves any problems. given thread 0-n all executing > LOCK instructions, here's a valid ordering: > > 0 1 2 n > lock stall stall ... stall > lock stall stall ... stall > ... ... > lock stall stall ... stall > > i'm not sure if the LOCK really changes the situation. any old exclusive > cacheline access should do? It is an ordering, but I don't think it's a valid one: your ellipses suggest an unbounded execution time (given the context of the discussion). I don't think that's valid because the protocol can't possibly negotiate execution for more instructions than it has space for in its pipeline. Furthermore, the pipeline cannot possibly be filled with LOCK-prefixed instructions because it also needs to schedule instruction loading, and it pipelines =CE=BCops, not whole instructions anyway. Furthermore, part of the execution cycle is decomposing an instruction into its =CE=BCop parts. At some point, that processor is not going to be executing a LOCK instruction, it is going to be executing some other =CE=BCop (like decoding the next LOCK-prefixed instruction it wants to execute). This won't be done with any synchronization. When this happens, other processors will execute their LOCK-prefixed instructions. The only way I could think to try to force this execution history was to unroll a loop of LOCK-prefixed instructions. In a tight loop, a program I wrote to do LOCK XADD 10 billion times per-thread (across 4 threads on my 4 core system) finished with a standard deviation in cycle count of around 1%. When I unroll the loop enough to fill the pipeline, the stddev actually decreases (to about 0.5%), which leads me to believe that the processor actively mitigates that sort of instruction "attack" for highly concurrent workloads. So either way, you're still bounded. Eventually p0 has to go do something that isn't a LOCK-prefixed instruction, like decode the next one. I don't know how to get the execution order you suggest. You'd have to manage to fill the pipeline on the processor while starving the pipeline on the others and preventing them from executing any further instructions. Instruction load and decode stages are shared, so I really don't see how you'd manage this without using PAUSE strategically. You'd have to con the processor into executing that order. At that point, just use a mutex :) --dho > the documentation appears not to cover this completely. > > - erik >