* [PATCH] replace a mfence instruction by an xchg instruction @ 2015-08-15 6:51 Jens Gustedt 2015-08-15 20:17 ` Rich Felker 2015-08-16 16:28 ` Rich Felker 0 siblings, 2 replies; 15+ messages in thread From: Jens Gustedt @ 2015-08-15 6:51 UTC (permalink / raw) To: musl according to the wisdom of the Internet, e.g https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ a mfence instruction is about 3 times slower than an xchg instruction. Here we not only had mfence but also the mov instruction that was to be protected by the fence. Replace all that by a native atomic instruction that gives all the ordering guarantees that we need. This a_store function is performance critical for the __lock primitive. In my benchmarks to test my stdatomic implementation I have a substantial performance increase (more than 10%), just because malloc does better with it. --- arch/x32/atomic.h | 4 ++-- arch/x86_64/atomic.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x32/atomic.h b/arch/x32/atomic.h index 2ab1f7a..3a2f391 100644 --- a/arch/x32/atomic.h +++ b/arch/x32/atomic.h @@ -81,9 +81,9 @@ static inline void a_dec(volatile int *x) __asm__( "lock ; decl %0" : "=m"(*x) : "m"(*x) : "memory" ); } -static inline void a_store(volatile int *p, int x) +static inline void a_store(volatile int *x, int v) { - __asm__( "mov %1, %0 ; mfence" : "=m"(*p) : "r"(x) : "memory" ); + __asm__( "xchg %0, %1" : "=r"(v), "=m"(*x) : "0"(v) : "memory" ); } static inline void a_spin() diff --git a/arch/x86_64/atomic.h b/arch/x86_64/atomic.h index 2ab1f7a..3a2f391 100644 --- a/arch/x86_64/atomic.h +++ b/arch/x86_64/atomic.h @@ -81,9 +81,9 @@ static inline void a_dec(volatile int *x) __asm__( "lock ; decl %0" : "=m"(*x) : "m"(*x) : "memory" ); } -static inline void a_store(volatile int *p, int x) +static inline void a_store(volatile int *x, int v) { - __asm__( "mov %1, %0 ; mfence" : "=m"(*p) : "r"(x) : "memory" ); + __asm__( "xchg %0, %1" : "=r"(v), "=m"(*x) : "0"(v) : "memory" ); } static inline void a_spin() -- 2.1.4 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-15 6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt @ 2015-08-15 20:17 ` Rich Felker 2015-08-15 21:01 ` Jens Gustedt 2015-08-16 16:28 ` Rich Felker 1 sibling, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-15 20:17 UTC (permalink / raw) To: musl On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > according to the wisdom of the Internet, e.g > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > a mfence instruction is about 3 times slower than an xchg instruction. Uhg, then why does this instruction even exist if it does less and does it slower? > Here we not only had mfence but also the mov instruction that was to be > protected by the fence. Replace all that by a native atomic instruction > that gives all the ordering guarantees that we need. > > This a_store function is performance critical for the __lock > primitive. In my benchmarks to test my stdatomic implementation I have a > substantial performance increase (more than 10%), just because malloc > does better with it. Is there a reason you're not using the same approach as on i386? It was faster than xchg for me, and in principle it "should be faster". Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-15 20:17 ` Rich Felker @ 2015-08-15 21:01 ` Jens Gustedt 2015-08-15 23:28 ` Rich Felker 0 siblings, 1 reply; 15+ messages in thread From: Jens Gustedt @ 2015-08-15 21:01 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2290 bytes --] Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > according to the wisdom of the Internet, e.g > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > Uhg, then why does this instruction even exist if it does less and > does it slower? Because they do different things ?) mfence is to synchronize all memory, xchg, at least at a first glance, only one word. But I also read that the relative performance of these instructions depend a lot on the actual dice you are dealing with. > > Here we not only had mfence but also the mov instruction that was to be > > protected by the fence. Replace all that by a native atomic instruction > > that gives all the ordering guarantees that we need. > > > > This a_store function is performance critical for the __lock > > primitive. In my benchmarks to test my stdatomic implementation I have a > > substantial performance increase (more than 10%), just because malloc > > does better with it. > > Is there a reason you're not using the same approach as on i386? It > was faster than xchg for me, and in principle it "should be faster". I discovered your approach for i386 after I experimented with "xchg" fore x86_64. I guess the "lock orl" instruction is a replacement for "mfence" because that one is not implemented for all variants of i386? Exactly why a "mov" followed by a read-modify-write operation to some random address (here the stack pointer) should be faster than a read-modify-write operation with exactly the address you want to deal with looks weird. I trust you that it does, but seen from outside this arch stuff resembles more voodoo than anything else. I'll experiment a bit with "mov" and your approach a see what I get. Thanks Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-15 21:01 ` Jens Gustedt @ 2015-08-15 23:28 ` Rich Felker 2015-08-16 12:42 ` Jens Gustedt 0 siblings, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-15 23:28 UTC (permalink / raw) To: musl On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > according to the wisdom of the Internet, e.g > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > Uhg, then why does this instruction even exist if it does less and > > does it slower? > > Because they do different things ?) > > mfence is to synchronize all memory, xchg, at least at a first glance, > only one word. No, any lock-prefixed instruction, or xchg which has a builtin lock, fully orders all memory accesses. Essentially it contains a builtin mfence. This is why I find it odd that a lone mfence is 3x slower. > But I also read that the relative performance of these instructions > depend a lot on the actual dice you are dealing with. Did you measure mfence being slower here? > > > Here we not only had mfence but also the mov instruction that was to be > > > protected by the fence. Replace all that by a native atomic instruction > > > that gives all the ordering guarantees that we need. > > > > > > This a_store function is performance critical for the __lock > > > primitive. In my benchmarks to test my stdatomic implementation I have a > > > substantial performance increase (more than 10%), just because malloc > > > does better with it. > > > > Is there a reason you're not using the same approach as on i386? It > > was faster than xchg for me, and in principle it "should be faster". > > I discovered your approach for i386 after I experimented with "xchg" > fore x86_64. I guess the "lock orl" instruction is a replacement for > "mfence" because that one is not implemented for all variants of i386? > > Exactly why a "mov" followed by a read-modify-write operation to some > random address (here the stack pointer) should be faster than a > read-modify-write operation with exactly the address you want to deal > with looks weird. xchg on the atomic address in principle reads a cache line (the write destination) that's known to be shared with other threads despite not needing it, then modifies is. mov on the other hand just appends the write to the local store buffer; reading the target cache line is not necessary. The subsequent barrier then takes care of ordering issues (just the one case x86 doesn't guarantee already). At least that's what happens in theory. It seemed to be slightly faster in practice for me too, which is why I went with it for now (theoretically faster + weak empirical evidence that it's faster = reasonable basis for tentative conclusion, IMO) but I'm open to further study of the different approaches and changing if we find something else is better. Ultimately I think I'd like to get rid of a_store at some point anyway since I think we can do better without it. Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-15 23:28 ` Rich Felker @ 2015-08-16 12:42 ` Jens Gustedt 2015-08-16 15:16 ` Rich Felker 2015-08-16 16:07 ` Rich Felker 0 siblings, 2 replies; 15+ messages in thread From: Jens Gustedt @ 2015-08-16 12:42 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 6669 bytes --] Hello, Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker: > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > > according to the wisdom of the Internet, e.g > > > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > > > Uhg, then why does this instruction even exist if it does less and > > > does it slower? > > > > Because they do different things ?) > > > > mfence is to synchronize all memory, xchg, at least at a first glance, > > only one word. > > No, any lock-prefixed instruction, or xchg which has a builtin lock, > fully orders all memory accesses. Essentially it contains a builtin > mfence. Hm, I think mfence does a bit more than that. The three fence instructions were introduced when they invented the asynchronous ("non-temporal") move instructions that came with sse. I don't think that "lock" instructions synchronize with these asynchronous moves, so the two (lock instructions and fences) are just different types of animals. And this answers perhaps your question up-thread, why there is actually something like mfence. > This is why I find it odd that a lone mfence is 3x slower. > > But I also read that the relative performance of these instructions > > depend a lot on the actual dice you are dealing with. > > Did you measure mfence being slower here? Yes, I attach two graphs that show this (and other things). This is my benchmark for testing generic atomics, here by using __lock/__unlock as a primitive. If you just look at the performance difference for one thread, it is is dramatic. Just by changing the a_store instruction the application throughput is 40% better. The difference comes not only from the fact that performance of __lock/__unlock changes, but also because the whole rest of musl changes performance. In particular, I think, because the __lock/__unlock clones in malloc are affected by that change, too. (That benchmarks also does a lot of malloc/free.) What I didn't notice before running these tests systematically this night, is that the performance of that bench also changes for the congestion case, but there it is the other way around. > > > > Here we not only had mfence but also the mov instruction that was to be > > > > protected by the fence. Replace all that by a native atomic instruction > > > > that gives all the ordering guarantees that we need. > > > > > > > > This a_store function is performance critical for the __lock > > > > primitive. In my benchmarks to test my stdatomic implementation I have a > > > > substantial performance increase (more than 10%), just because malloc > > > > does better with it. > > > > > > Is there a reason you're not using the same approach as on i386? It > > > was faster than xchg for me, and in principle it "should be faster". > > > > I discovered your approach for i386 after I experimented with "xchg" > > fore x86_64. I guess the "lock orl" instruction is a replacement for > > "mfence" because that one is not implemented for all variants of i386? > > > > Exactly why a "mov" followed by a read-modify-write operation to some > > random address (here the stack pointer) should be faster than a > > read-modify-write operation with exactly the address you want to deal > > with looks weird. > > xchg on the atomic address in principle reads a cache line (the write > destination) that's known to be shared with other threads despite not > needing it, then modifies is. mov on the other hand just appends the > write to the local store buffer; reading the target cache line is not > necessary. The subsequent barrier then takes care of ordering issues > (just the one case x86 doesn't guarantee already). > > At least that's what happens in theory. It seemed to be slightly > faster in practice for me too, which is why I went with it for now > (theoretically faster + weak empirical evidence that it's faster = > reasonable basis for tentative conclusion, IMO) but I'm open to > further study of the different approaches and changing if we find > something else is better. Yes, that theory probably isn't enough to describe things. If you look at the latency of a read-modify-write instructions you have something like L = I + 2 M where I is instruction latency itself, and M is the memory latency. M varies with the platform, chipset whatever, but it should be the same for all lock instructions on the same machine. Just to give an order of magnitude, M should be around 100. "I" in turn can be very different for different lock instructions, and in addition varies for different processor variants. If I read the tables correctly it can be from about 10 to 100 cycles, so something really noticeable compared to M. What you'd like to have for a store instruction is L_{store} = I_{store} + M avoiding the second M for the return from memory. (The second M can be hidden until we hit the next load instruction) Your approach with orl has I_{mov} + I_{or} + M the original one has I_{mov} + I_{mfence} + 2 M which gives you an extra M the xchg approach has I_{xchg} + 2 M From the graphs we can deduce that in fact the orl strategy is more stable that the two others, but still the 20% gap for the "normal" situation worries me. > Ultimately I think I'd like to get rid of a_store at some point anyway > since I think we can do better without it. At least in this context, yes. From what I see at the moment the best strategy for the initial part of a locking primitive is to only use cmpxchg instructions. This instruction sets "cc", so if is integrated properly by the compiler, it can be used to conditionally jump, without even inspecting the value that had previously been in memory. Currently within musl such a close integration is difficult, because you can't easily jump from within __asm__. When using my stdatomic implementation (only building upon the builtin compiler support for atomics) the generated code is much better. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #1.2: test-musl-all.eps --] [-- Type: image/x-eps, Size: 21344 bytes --] [-- Attachment #1.3: test-musl-relative.eps --] [-- Type: image/x-eps, Size: 21498 bytes --] [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 12:42 ` Jens Gustedt @ 2015-08-16 15:16 ` Rich Felker 2015-08-16 15:50 ` Jens Gustedt 2015-08-16 16:07 ` Rich Felker 1 sibling, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-16 15:16 UTC (permalink / raw) To: musl On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote: > Hello, > > Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker: > > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > > > according to the wisdom of the Internet, e.g > > > > > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > > > > > Uhg, then why does this instruction even exist if it does less and > > > > does it slower? > > > > > > Because they do different things ?) > > > > > > mfence is to synchronize all memory, xchg, at least at a first glance, > > > only one word. > > > > No, any lock-prefixed instruction, or xchg which has a builtin lock, > > fully orders all memory accesses. Essentially it contains a builtin > > mfence. > > Hm, I think mfence does a bit more than that. The three fence > instructions were introduced when they invented the asynchronous > ("non-temporal") move instructions that came with sse. > > I don't think that "lock" instructions synchronize with these > asynchronous moves, so the two (lock instructions and fences) are just > different types of animals. And this answers perhaps your question > up-thread, why there is actually something like mfence. The relevant text seems to be the Intel manual, Vol 3A, 8.2.2 Memory Ordering in P6 and More Recent Processor Families: ---------------------------------------------------------------------- Reads are not reordered with other reads. Writes are not reordered with older reads. Writes to memory are not reordered with other writes, with the following exceptions: — writes executed with the CLFLUSH instruction; — streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and — string operations (see Section 8.2.4.1). Reads may be reordered with older writes to different locations but not with older writes to the same location. Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions. Reads cannot pass earlier LFENCE and MFENCE instructions. Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions. LFENCE instructions cannot pass earlier reads. SFENCE instructions cannot pass earlier writes. MFENCE instructions cannot pass earlier reads or writes ---------------------------------------------------------------------- See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf So mfence seems to be weaker than lock-prefixed instructions in terms of the ordering it imposes (lock-prefixed instructions forbid reordering and also have a total ordering across all cores). Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 15:16 ` Rich Felker @ 2015-08-16 15:50 ` Jens Gustedt 2015-08-16 15:58 ` Rich Felker 0 siblings, 1 reply; 15+ messages in thread From: Jens Gustedt @ 2015-08-16 15:50 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 4004 bytes --] Am Sonntag, den 16.08.2015, 11:16 -0400 schrieb Rich Felker: > On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote: > > Hello, > > > > Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker: > > > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > > > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > > > > according to the wisdom of the Internet, e.g > > > > > > > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > > > > > > > Uhg, then why does this instruction even exist if it does less and > > > > > does it slower? > > > > > > > > Because they do different things ?) > > > > > > > > mfence is to synchronize all memory, xchg, at least at a first glance, > > > > only one word. > > > > > > No, any lock-prefixed instruction, or xchg which has a builtin lock, > > > fully orders all memory accesses. Essentially it contains a builtin > > > mfence. > > > > Hm, I think mfence does a bit more than that. The three fence > > instructions were introduced when they invented the asynchronous > > ("non-temporal") move instructions that came with sse. > > > > I don't think that "lock" instructions synchronize with these > > asynchronous moves, so the two (lock instructions and fences) are just > > different types of animals. And this answers perhaps your question > > up-thread, why there is actually something like mfence. > > The relevant text seems to be the Intel manual, Vol 3A, 8.2.2 Memory > Ordering in P6 and More Recent Processor Families: > > ---------------------------------------------------------------------- > Reads are not reordered with other reads. > > Writes are not reordered with older reads. > > Writes to memory are not reordered with other writes, with the > following exceptions: > — writes executed with the CLFLUSH instruction; > — streaming stores (writes) executed with the non-temporal move > instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and > — string operations (see Section 8.2.4.1). > > Reads may be reordered with older writes to different locations but > not with older writes to the same location. > > Reads or writes cannot be reordered with I/O instructions, locked > instructions, or serializing instructions. > > Reads cannot pass earlier LFENCE and MFENCE instructions. > > Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions. > > LFENCE instructions cannot pass earlier reads. > > SFENCE instructions cannot pass earlier writes. > > MFENCE instructions cannot pass earlier reads or writes > ---------------------------------------------------------------------- > > See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf > > So mfence seems to be weaker than lock-prefixed instructions in terms > of the ordering it imposes (lock-prefixed instructions forbid > reordering and also have a total ordering across all cores). Yes, it says so on page 8-26 that the fences are definitively not serializing instructions. (But what I tried to show in my previous mail still holds, the instruction latency itself plays a big part in the efficiency of these instructions.) I read all of that as: - mfence can be used to achieve acq_rel ordering - none of the fences can be use to achieve seq_cst ordering Wasn't the idea that all atomic.h functions implement sequential consistency? Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 15:50 ` Jens Gustedt @ 2015-08-16 15:58 ` Rich Felker 2015-08-16 16:16 ` Jens Gustedt 0 siblings, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-16 15:58 UTC (permalink / raw) To: musl On Sun, Aug 16, 2015 at 05:50:21PM +0200, Jens Gustedt wrote: > > See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf > > > > So mfence seems to be weaker than lock-prefixed instructions in terms > > of the ordering it imposes (lock-prefixed instructions forbid > > reordering and also have a total ordering across all cores). > > Yes, it says so on page 8-26 that the fences are definitively not > serializing instructions. > > (But what I tried to show in my previous mail still holds, the > instruction latency itself plays a big part in the efficiency of these > instructions.) I wasn't trying to contradict anything you've said, just expressing the absurdity of mfence being slower than lock-prefixed instructions, since it's a strictly-weaker operation. > I read all of that as: > > - mfence can be used to achieve acq_rel ordering > - none of the fences can be use to achieve seq_cst ordering By this you mean that only lock-prefixed instructions impose a total order across all cores? > Wasn't the idea that all atomic.h functions implement sequential > consistency? Yes, that's the intent, but I don't want to introduce 'major' performance regressions fixing 'minor' failures to be seq_cst if there's no observable misbehavior in the code using them. Still it would be nice to know whether such failures still exist, and if so where, so we can eventually clean this up. Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 15:58 ` Rich Felker @ 2015-08-16 16:16 ` Jens Gustedt 2015-08-16 18:30 ` Rich Felker 0 siblings, 1 reply; 15+ messages in thread From: Jens Gustedt @ 2015-08-16 16:16 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2559 bytes --] Am Sonntag, den 16.08.2015, 11:58 -0400 schrieb Rich Felker: > On Sun, Aug 16, 2015 at 05:50:21PM +0200, Jens Gustedt wrote: > > > See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf > > > > > > So mfence seems to be weaker than lock-prefixed instructions in terms > > > of the ordering it imposes (lock-prefixed instructions forbid > > > reordering and also have a total ordering across all cores). > > > > Yes, it says so on page 8-26 that the fences are definitively not > > serializing instructions. > > > > (But what I tried to show in my previous mail still holds, the > > instruction latency itself plays a big part in the efficiency of these > > instructions.) > > I wasn't trying to contradict anything you've said, just expressing > the absurdity of mfence being slower than lock-prefixed instructions, > since it's a strictly-weaker operation. Yes, I got that :) One argument that we neglected for the moment, is the impact on other threads/cores. Even if such an mfence instruction may be more expensive for the thread that issues it, it imposes less constraints to other threads. Maybe overall this could be win? > > I read all of that as: > > > > - mfence can be used to achieve acq_rel ordering > > - none of the fences can be use to achieve seq_cst ordering > > By this you mean that only lock-prefixed instructions impose a total > order across all cores? Plus these very expensive complete serializing instructions that are listed in the manual. > > Wasn't the idea that all atomic.h functions implement sequential > > consistency? > > Yes, that's the intent, but I don't want to introduce 'major' > performance regressions fixing 'minor' failures to be seq_cst if > there's no observable misbehavior in the code using them. Misbehavior here is really hard to track down. Especially having an application that changes behavior if it is not guaranteed seq_cst is probably quite difficult to observe. > Still it > would be nice to know whether such failures still exist, and if so > where, so we can eventually clean this up. Replacing "mfence" by "lock ; orl $0,(%%rsp)" would provide us with security by not compromising performance :) Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 16:16 ` Jens Gustedt @ 2015-08-16 18:30 ` Rich Felker 0 siblings, 0 replies; 15+ messages in thread From: Rich Felker @ 2015-08-16 18:30 UTC (permalink / raw) To: musl On Sun, Aug 16, 2015 at 06:16:41PM +0200, Jens Gustedt wrote: > > Still it > > would be nice to know whether such failures still exist, and if so > > where, so we can eventually clean this up. > > Replacing "mfence" by "lock ; orl $0,(%%rsp)" would provide us with > security by not compromising performance :) OK, I've done that for now. If we determine something else is better in the future we can change it but for now I just want to reduce the magnitude of the performance regression that will go in the release, and this does it. :) Thanks. Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 12:42 ` Jens Gustedt 2015-08-16 15:16 ` Rich Felker @ 2015-08-16 16:07 ` Rich Felker 2015-08-16 16:34 ` Jens Gustedt 1 sibling, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-16 16:07 UTC (permalink / raw) To: musl On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote: > > > Exactly why a "mov" followed by a read-modify-write operation to some > > > random address (here the stack pointer) should be faster than a > > > read-modify-write operation with exactly the address you want to deal > > > with looks weird. > > > > xchg on the atomic address in principle reads a cache line (the write > > destination) that's known to be shared with other threads despite not > > needing it, then modifies is. mov on the other hand just appends the > > write to the local store buffer; reading the target cache line is not > > necessary. The subsequent barrier then takes care of ordering issues > > (just the one case x86 doesn't guarantee already). > > > > At least that's what happens in theory. It seemed to be slightly > > faster in practice for me too, which is why I went with it for now > > (theoretically faster + weak empirical evidence that it's faster = > > reasonable basis for tentative conclusion, IMO) but I'm open to > > further study of the different approaches and changing if we find > > something else is better. > > Yes, that theory probably isn't enough to describe things. If you look > at the latency of a read-modify-write instructions you have something > like > > L = I + 2 M > > where I is instruction latency itself, and M is the memory latency. M > varies with the platform, chipset whatever, but it should be the same > for all lock instructions on the same machine. Just to give an order > of magnitude, M should be around 100. > > "I" in turn can be very different for different lock instructions, and > in addition varies for different processor variants. If I read the > tables correctly it can be from about 10 to 100 cycles, so something > really noticeable compared to M. I'm confused where you're getting this from. The instruction latency for most basic ALU and mov type operations on x86 is in the single-digit range (and I mean binary digits, in many cases ;-). > What you'd like to have for a store instruction is > > L_{store} = I_{store} + M > > avoiding the second M for the return from memory. (The second M can be > hidden until we hit the next load instruction) > > Your approach with orl has > > I_{mov} + I_{or} + M > > the original one has > > I_{mov} + I_{mfence} + 2 M > > which gives you an extra M > > the xchg approach has > > I_{xchg} + 2 M > > From the graphs we can deduce that in fact the orl strategy is more > stable that the two others, but still the 20% gap for the "normal" > situation worries me. Where are you measuring "20% gap for the 'normal' situation"? I'm missing some context here. > > Ultimately I think I'd like to get rid of a_store at some point anyway > > since I think we can do better without it. > > At least in this context, yes. From what I see at the moment the best > strategy for the initial part of a locking primitive is to only use > cmpxchg instructions. This instruction sets "cc", so if is integrated > properly by the compiler, it can be used to conditionally jump, > without even inspecting the value that had previously been in memory. > > Currently within musl such a close integration is difficult, because > you can't easily jump from within __asm__. When using my stdatomic > implementation (only building upon the builtin compiler support for > atomics) the generated code is much better. While I'd like to be able to use cc/flags directly, I don't think this is a worthwhile optimization, at least on x86. The difference in performance is totally dominated by the cost of the atomic op, and the old-value output of cmpxchg is a register anyway so there's no extra memory-read operation. Actually I experimented with GCC asm-goto labels for sh4a ll/sc atomics, because getting the result out of the cc bit into a register than testing it is mildly expensive there, and gcc generated amazingly good code, but the problems are: (1) it depends on a nasty gcc feature that other compilers don't implement well or at all, and (2) it imposes a branch instruction inside the inline asm even if the caller wants to throw away the result. In any case, this does not seem to be a worthwhile area to optimize. The real costs of atomics are the memory latency and cache synchronization overhead, not any time spent in instructions working purely with registers. Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 16:07 ` Rich Felker @ 2015-08-16 16:34 ` Jens Gustedt 0 siblings, 0 replies; 15+ messages in thread From: Jens Gustedt @ 2015-08-16 16:34 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 6235 bytes --] Am Sonntag, den 16.08.2015, 12:07 -0400 schrieb Rich Felker: > On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote: > > > > Exactly why a "mov" followed by a read-modify-write operation to some > > > > random address (here the stack pointer) should be faster than a > > > > read-modify-write operation with exactly the address you want to deal > > > > with looks weird. > > > > > > xchg on the atomic address in principle reads a cache line (the write > > > destination) that's known to be shared with other threads despite not > > > needing it, then modifies is. mov on the other hand just appends the > > > write to the local store buffer; reading the target cache line is not > > > necessary. The subsequent barrier then takes care of ordering issues > > > (just the one case x86 doesn't guarantee already). > > > > > > At least that's what happens in theory. It seemed to be slightly > > > faster in practice for me too, which is why I went with it for now > > > (theoretically faster + weak empirical evidence that it's faster = > > > reasonable basis for tentative conclusion, IMO) but I'm open to > > > further study of the different approaches and changing if we find > > > something else is better. > > > > Yes, that theory probably isn't enough to describe things. If you look > > at the latency of a read-modify-write instructions you have something > > like > > > > L = I + 2 M > > > > where I is instruction latency itself, and M is the memory latency. M > > varies with the platform, chipset whatever, but it should be the same > > for all lock instructions on the same machine. Just to give an order > > of magnitude, M should be around 100. > > > > "I" in turn can be very different for different lock instructions, and > > in addition varies for different processor variants. If I read the > > tables correctly it can be from about 10 to 100 cycles, so something > > really noticeable compared to M. > > I'm confused where you're getting this from. The instruction latency > for most basic ALU and mov type operations on x86 is in the > single-digit range (and I mean binary digits, in many cases ;-). I was talking about the latencies of locked instructions, they are significantly higher. So if you want to discuss overall latency of an instruction that guarantees memory order, you'd have to take these latencies into account. > > What you'd like to have for a store instruction is > > > > L_{store} = I_{store} + M > > > > avoiding the second M for the return from memory. (The second M can be > > hidden until we hit the next load instruction) > > > > Your approach with orl has > > > > I_{mov} + I_{or} + M > > > > the original one has > > > > I_{mov} + I_{mfence} + 2 M > > > > which gives you an extra M > > > > the xchg approach has > > > > I_{xchg} + 2 M > > > > From the graphs we can deduce that in fact the orl strategy is more > > stable that the two others, but still the 20% gap for the "normal" > > situation worries me. > > Where are you measuring "20% gap for the 'normal' situation"? I'm > missing some context here. In the graphics that I had attached. For 1 thread, the performance improvement against the "mfence" version with the "orl" trick is 20%, but the "xchg" gets even 20% more. I didn't measure this, but before the introduction of the "mfence" to "a_store" the performance should have been at least as for the "xchg" version. > > > Ultimately I think I'd like to get rid of a_store at some point anyway > > > since I think we can do better without it. > > > > At least in this context, yes. From what I see at the moment the best > > strategy for the initial part of a locking primitive is to only use > > cmpxchg instructions. This instruction sets "cc", so if is integrated > > properly by the compiler, it can be used to conditionally jump, > > without even inspecting the value that had previously been in memory. > > > > Currently within musl such a close integration is difficult, because > > you can't easily jump from within __asm__. When using my stdatomic > > implementation (only building upon the builtin compiler support for > > atomics) the generated code is much better. > > While I'd like to be able to use cc/flags directly, I don't think this > is a worthwhile optimization, at least on x86. The difference in > performance is totally dominated by the cost of the atomic op, and the > old-value output of cmpxchg is a register anyway so there's no extra > memory-read operation. Actually I experimented with GCC asm-goto > labels for sh4a ll/sc atomics, because getting the result out of the > cc bit into a register than testing it is mildly expensive there, and > gcc generated amazingly good code, but the problems are: > > (1) it depends on a nasty gcc feature that other compilers don't > implement well or at all, and > > (2) it imposes a branch instruction inside the inline asm even if the > caller wants to throw away the result. Yes, I definitively agree. That is what you can avoid by using the compiler's own constructs, the __sync, __atomic or __c11_atomic buitlins. Here they know about the features of the cmpexchg instruction and can do appropriate jumps. > In any case, this does not seem to be a worthwhile area to optimize. > The real costs of atomics are the memory latency and cache > synchronization overhead, not any time spent in instructions working > purely with registers. I don't agree. First this was not completely what I meant. I think being able to use the CC to jump can also abort some fetches of memory that are still in the pipelin, so this is perhaps more than some "register" instructions. Then, at the beginning of all lock constructs we have tight spin loops. As far as I can tell, in case of contention every instruction in them counts. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-15 6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt 2015-08-15 20:17 ` Rich Felker @ 2015-08-16 16:28 ` Rich Felker 2015-08-16 16:38 ` Jens Gustedt 1 sibling, 1 reply; 15+ messages in thread From: Rich Felker @ 2015-08-16 16:28 UTC (permalink / raw) To: musl On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > according to the wisdom of the Internet, e.g > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > a mfence instruction is about 3 times slower than an xchg instruction. I can't find where the article makes this claim. Could you point out what part you're referring to? Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 16:28 ` Rich Felker @ 2015-08-16 16:38 ` Jens Gustedt 2015-08-16 17:00 ` Rich Felker 0 siblings, 1 reply; 15+ messages in thread From: Jens Gustedt @ 2015-08-16 16:38 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1001 bytes --] Am Sonntag, den 16.08.2015, 12:28 -0400 schrieb Rich Felker: > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > according to the wisdom of the Internet, e.g > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > I can't find where the article makes this claim. Could you point out > what part you're referring to? I read this in the section that says "Performance comparsion". There it says something like "lock xchg" 16x baseline and "smfence" 47-67 x baseline. But perhaps I am misinterpreting things. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] replace a mfence instruction by an xchg instruction 2015-08-16 16:38 ` Jens Gustedt @ 2015-08-16 17:00 ` Rich Felker 0 siblings, 0 replies; 15+ messages in thread From: Rich Felker @ 2015-08-16 17:00 UTC (permalink / raw) To: musl On Sun, Aug 16, 2015 at 06:38:32PM +0200, Jens Gustedt wrote: > Am Sonntag, den 16.08.2015, 12:28 -0400 schrieb Rich Felker: > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > according to the wisdom of the Internet, e.g > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > I can't find where the article makes this claim. Could you point out > > what part you're referring to? > > I read this in the section that says "Performance comparsion". > > There it says something like "lock xchg" 16x baseline and "smfence" > 47-67 x baseline. But perhaps I am misinterpreting things. This is comparing an idiotic (and invalid, but maybe barely-working on x86) Dekker lock using mfence with a simple xchg-based lock. It's not comparing mfence to lock-xchg as barriers. Rich ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-08-16 18:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-08-15 6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt 2015-08-15 20:17 ` Rich Felker 2015-08-15 21:01 ` Jens Gustedt 2015-08-15 23:28 ` Rich Felker 2015-08-16 12:42 ` Jens Gustedt 2015-08-16 15:16 ` Rich Felker 2015-08-16 15:50 ` Jens Gustedt 2015-08-16 15:58 ` Rich Felker 2015-08-16 16:16 ` Jens Gustedt 2015-08-16 18:30 ` Rich Felker 2015-08-16 16:07 ` Rich Felker 2015-08-16 16:34 ` Jens Gustedt 2015-08-16 16:28 ` Rich Felker 2015-08-16 16:38 ` Jens Gustedt 2015-08-16 17:00 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).