[PATCH] replace a mfence instruction by an xchg instruction

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [PATCH] replace a mfence instruction by an xchg instruction
@ 2015-08-15  6:51 Jens Gustedt
  2015-08-15 20:17 ` Rich Felker
  2015-08-16 16:28 ` Rich Felker
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Gustedt @ 2015-08-15  6:51 UTC (permalink / raw)
  To: musl

according to the wisdom of the Internet, e.g

https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/

a mfence instruction is about 3 times slower than an xchg instruction.

Here we not only had mfence but also the mov instruction that was to be
protected by the fence. Replace all that by a native atomic instruction
that gives all the ordering guarantees that we need.

This a_store function is performance critical for the __lock
primitive. In my benchmarks to test my stdatomic implementation I have a
substantial performance increase (more than 10%), just because malloc
does better with it.
---
 arch/x32/atomic.h    | 4 ++--
 arch/x86_64/atomic.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x32/atomic.h b/arch/x32/atomic.h
index 2ab1f7a..3a2f391 100644
--- a/arch/x32/atomic.h
+++ b/arch/x32/atomic.h
@@ -81,9 +81,9 @@ static inline void a_dec(volatile int *x)
 	__asm__( "lock ; decl %0" : "=m"(*x) : "m"(*x) : "memory" );
 }
 
-static inline void a_store(volatile int *p, int x)
+static inline void a_store(volatile int *x, int v)
 {
-	__asm__( "mov %1, %0 ; mfence" : "=m"(*p) : "r"(x) : "memory" );
+	__asm__( "xchg %0, %1" : "=r"(v), "=m"(*x) : "0"(v) : "memory" );
 }
 
 static inline void a_spin()
diff --git a/arch/x86_64/atomic.h b/arch/x86_64/atomic.h
index 2ab1f7a..3a2f391 100644
--- a/arch/x86_64/atomic.h
+++ b/arch/x86_64/atomic.h
@@ -81,9 +81,9 @@ static inline void a_dec(volatile int *x)
 	__asm__( "lock ; decl %0" : "=m"(*x) : "m"(*x) : "memory" );
 }
 
-static inline void a_store(volatile int *p, int x)
+static inline void a_store(volatile int *x, int v)
 {
-	__asm__( "mov %1, %0 ; mfence" : "=m"(*p) : "r"(x) : "memory" );
+	__asm__( "xchg %0, %1" : "=r"(v), "=m"(*x) : "0"(v) : "memory" );
 }
 
 static inline void a_spin()
-- 
2.1.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-15  6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt
@ 2015-08-15 20:17 ` Rich Felker
  2015-08-15 21:01   ` Jens Gustedt
  2015-08-16 16:28 ` Rich Felker
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-15 20:17 UTC (permalink / raw)
  To: musl

On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> according to the wisdom of the Internet, e.g
> 
> https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> 
> a mfence instruction is about 3 times slower than an xchg instruction.

Uhg, then why does this instruction even exist if it does less and
does it slower?

> Here we not only had mfence but also the mov instruction that was to be
> protected by the fence. Replace all that by a native atomic instruction
> that gives all the ordering guarantees that we need.
> 
> This a_store function is performance critical for the __lock
> primitive. In my benchmarks to test my stdatomic implementation I have a
> substantial performance increase (more than 10%), just because malloc
> does better with it.

Is there a reason you're not using the same approach as on i386? It
was faster than xchg for me, and in principle it "should be faster".

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-15 20:17 ` Rich Felker
@ 2015-08-15 21:01   ` Jens Gustedt
  2015-08-15 23:28     ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2015-08-15 21:01 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > according to the wisdom of the Internet, e.g
> > 
> > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > 
> > a mfence instruction is about 3 times slower than an xchg instruction.
> 
> Uhg, then why does this instruction even exist if it does less and
> does it slower?

Because they do different things ?)

mfence is to synchronize all memory, xchg, at least at a first glance,
only one word.

But I also read that the relative performance of these instructions
depend a lot on the actual dice you are dealing with.

> > Here we not only had mfence but also the mov instruction that was to be
> > protected by the fence. Replace all that by a native atomic instruction
> > that gives all the ordering guarantees that we need.
> > 
> > This a_store function is performance critical for the __lock
> > primitive. In my benchmarks to test my stdatomic implementation I have a
> > substantial performance increase (more than 10%), just because malloc
> > does better with it.
> 
> Is there a reason you're not using the same approach as on i386? It
> was faster than xchg for me, and in principle it "should be faster".

I discovered your approach for i386 after I experimented with "xchg"
fore x86_64. I guess the "lock orl" instruction is a replacement for
"mfence" because that one is not implemented for all variants of i386?

Exactly why a "mov" followed by a read-modify-write operation to some
random address (here the stack pointer) should be faster than a
read-modify-write operation with exactly the address you want to deal
with looks weird.

I trust you that it does, but seen from outside this arch stuff
resembles more voodoo than anything else.

I'll experiment a bit with "mov" and your approach a see what I get.

Thanks

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-15 21:01   ` Jens Gustedt
@ 2015-08-15 23:28     ` Rich Felker
  2015-08-16 12:42       ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-15 23:28 UTC (permalink / raw)
  To: musl

On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > according to the wisdom of the Internet, e.g
> > > 
> > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > 
> > > a mfence instruction is about 3 times slower than an xchg instruction.
> > 
> > Uhg, then why does this instruction even exist if it does less and
> > does it slower?
> 
> Because they do different things ?)
> 
> mfence is to synchronize all memory, xchg, at least at a first glance,
> only one word.

No, any lock-prefixed instruction, or xchg which has a builtin lock,
fully orders all memory accesses. Essentially it contains a builtin
mfence. This is why I find it odd that a lone mfence is 3x slower.

> But I also read that the relative performance of these instructions
> depend a lot on the actual dice you are dealing with.

Did you measure mfence being slower here?

> > > Here we not only had mfence but also the mov instruction that was to be
> > > protected by the fence. Replace all that by a native atomic instruction
> > > that gives all the ordering guarantees that we need.
> > > 
> > > This a_store function is performance critical for the __lock
> > > primitive. In my benchmarks to test my stdatomic implementation I have a
> > > substantial performance increase (more than 10%), just because malloc
> > > does better with it.
> > 
> > Is there a reason you're not using the same approach as on i386? It
> > was faster than xchg for me, and in principle it "should be faster".
> 
> I discovered your approach for i386 after I experimented with "xchg"
> fore x86_64. I guess the "lock orl" instruction is a replacement for
> "mfence" because that one is not implemented for all variants of i386?
> 
> Exactly why a "mov" followed by a read-modify-write operation to some
> random address (here the stack pointer) should be faster than a
> read-modify-write operation with exactly the address you want to deal
> with looks weird.

xchg on the atomic address in principle reads a cache line (the write
destination) that's known to be shared with other threads despite not
needing it, then modifies is. mov on the other hand just appends the
write to the local store buffer; reading the target cache line is not
necessary. The subsequent barrier then takes care of ordering issues
(just the one case x86 doesn't guarantee already).

At least that's what happens in theory. It seemed to be slightly
faster in practice for me too, which is why I went with it for now
(theoretically faster + weak empirical evidence that it's faster =
reasonable basis for tentative conclusion, IMO) but I'm open to
further study of the different approaches and changing if we find
something else is better.

Ultimately I think I'd like to get rid of a_store at some point anyway
since I think we can do better without it.

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-15 23:28     ` Rich Felker
@ 2015-08-16 12:42       ` Jens Gustedt
  2015-08-16 15:16         ` Rich Felker
  2015-08-16 16:07         ` Rich Felker
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Gustedt @ 2015-08-16 12:42 UTC (permalink / raw)
  To: musl

[-- Attachment #1.1: Type: text/plain, Size: 6669 bytes --]

Hello,

Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker:
> On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > > according to the wisdom of the Internet, e.g
> > > > 
> > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > > 
> > > > a mfence instruction is about 3 times slower than an xchg instruction.
> > > 
> > > Uhg, then why does this instruction even exist if it does less and
> > > does it slower?
> > 
> > Because they do different things ?)
> > 
> > mfence is to synchronize all memory, xchg, at least at a first glance,
> > only one word.
> 
> No, any lock-prefixed instruction, or xchg which has a builtin lock,
> fully orders all memory accesses. Essentially it contains a builtin
> mfence.

Hm, I think mfence does a bit more than that. The three fence
instructions were introduced when they invented the asynchronous
("non-temporal") move instructions that came with sse.

I don't think that "lock" instructions synchronize with these
asynchronous moves, so the two (lock instructions and fences) are just
different types of animals. And this answers perhaps your question
up-thread, why there is actually something like mfence.

> This is why I find it odd that a lone mfence is 3x slower.

> > But I also read that the relative performance of these instructions
> > depend a lot on the actual dice you are dealing with.
> 
> Did you measure mfence being slower here?

Yes, I attach two graphs that show this (and other things). This is my
benchmark for testing generic atomics, here by using __lock/__unlock
as a primitive. If you just look at the performance difference for one
thread, it is is dramatic. Just by changing the a_store instruction
the application throughput is 40% better.

The difference comes not only from the fact that performance of
__lock/__unlock changes, but also because the whole rest of musl
changes performance. In particular, I think, because the
__lock/__unlock clones in malloc are affected by that change,
too. (That benchmarks also does a lot of malloc/free.)

What I didn't notice before running these tests systematically this
night, is that the performance of that bench also changes for the
congestion case, but there it is the other way around.

> > > > Here we not only had mfence but also the mov instruction that was to be
> > > > protected by the fence. Replace all that by a native atomic instruction
> > > > that gives all the ordering guarantees that we need.
> > > > 
> > > > This a_store function is performance critical for the __lock
> > > > primitive. In my benchmarks to test my stdatomic implementation I have a
> > > > substantial performance increase (more than 10%), just because malloc
> > > > does better with it.
> > > 
> > > Is there a reason you're not using the same approach as on i386? It
> > > was faster than xchg for me, and in principle it "should be faster".
> > 
> > I discovered your approach for i386 after I experimented with "xchg"
> > fore x86_64. I guess the "lock orl" instruction is a replacement for
> > "mfence" because that one is not implemented for all variants of i386?
> > 
> > Exactly why a "mov" followed by a read-modify-write operation to some
> > random address (here the stack pointer) should be faster than a
> > read-modify-write operation with exactly the address you want to deal
> > with looks weird.
> 
> xchg on the atomic address in principle reads a cache line (the write
> destination) that's known to be shared with other threads despite not
> needing it, then modifies is. mov on the other hand just appends the
> write to the local store buffer; reading the target cache line is not
> necessary. The subsequent barrier then takes care of ordering issues
> (just the one case x86 doesn't guarantee already).
> 
> At least that's what happens in theory. It seemed to be slightly
> faster in practice for me too, which is why I went with it for now
> (theoretically faster + weak empirical evidence that it's faster =
> reasonable basis for tentative conclusion, IMO) but I'm open to
> further study of the different approaches and changing if we find
> something else is better.

Yes, that theory probably isn't enough to describe things. If you look
at the latency of a read-modify-write instructions you have something
like

L = I + 2 M

where I is instruction latency itself, and M is the memory latency. M
varies with the platform, chipset whatever, but it should be the same
for all lock instructions on the same machine. Just to give an order
of magnitude, M should be around 100.

"I" in turn can be very different for different lock instructions, and
in addition varies for different processor variants. If I read the
tables correctly it can be from about 10 to 100 cycles, so something
really noticeable compared to M.

What you'd like to have for a store instruction is

L_{store} = I_{store} + M

avoiding the second M for the return from memory. (The second M can be
hidden until we hit the next load instruction)

Your approach with orl has

I_{mov} + I_{or} + M

the original one has

I_{mov} + I_{mfence} + 2 M

which gives you an extra M

the xchg approach has

I_{xchg} + 2 M

From the graphs we can deduce that in fact the orl strategy is more
stable that the two others, but still the 20% gap for the "normal"
situation worries me.

> Ultimately I think I'd like to get rid of a_store at some point anyway
> since I think we can do better without it.

At least in this context, yes. From what I see at the moment the best
strategy for the initial part of a locking primitive is to only use
cmpxchg instructions. This instruction sets "cc", so if is integrated
properly by the compiler, it can be used to conditionally jump,
without even inspecting the value that had previously been in memory.

Currently within musl such a close integration is difficult, because
you can't easily jump from within __asm__. When using my stdatomic
implementation (only building upon the builtin compiler support for
atomics) the generated code is much better.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #1.2: test-musl-all.eps --]
[-- Type: image/x-eps, Size: 21344 bytes --]

[-- Attachment #1.3: test-musl-relative.eps --]
[-- Type: image/x-eps, Size: 21498 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 12:42       ` Jens Gustedt
@ 2015-08-16 15:16         ` Rich Felker
  2015-08-16 15:50           ` Jens Gustedt
  2015-08-16 16:07         ` Rich Felker
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-16 15:16 UTC (permalink / raw)
  To: musl

On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> Hello,
> 
> Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker:
> > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > > > according to the wisdom of the Internet, e.g
> > > > > 
> > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > > > 
> > > > > a mfence instruction is about 3 times slower than an xchg instruction.
> > > > 
> > > > Uhg, then why does this instruction even exist if it does less and
> > > > does it slower?
> > > 
> > > Because they do different things ?)
> > > 
> > > mfence is to synchronize all memory, xchg, at least at a first glance,
> > > only one word.
> > 
> > No, any lock-prefixed instruction, or xchg which has a builtin lock,
> > fully orders all memory accesses. Essentially it contains a builtin
> > mfence.
> 
> Hm, I think mfence does a bit more than that. The three fence
> instructions were introduced when they invented the asynchronous
> ("non-temporal") move instructions that came with sse.
> 
> I don't think that "lock" instructions synchronize with these
> asynchronous moves, so the two (lock instructions and fences) are just
> different types of animals. And this answers perhaps your question
> up-thread, why there is actually something like mfence.

The relevant text seems to be the Intel manual, Vol 3A, 8.2.2 Memory
Ordering in P6 and More Recent Processor Families:

----------------------------------------------------------------------
Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the
following exceptions:
—   writes executed with the CLFLUSH instruction;
—   streaming stores (writes) executed with the non-temporal move
instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
—   string operations (see Section 8.2.4.1).

Reads may be reordered with older writes to different locations but
not with older writes to the same location. 

Reads or writes cannot be reordered with I/O instructions, locked
instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

LFENCE instructions cannot pass earlier reads.

SFENCE instructions cannot pass earlier writes.

MFENCE instructions cannot pass earlier reads or writes
----------------------------------------------------------------------

See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf

So mfence seems to be weaker than lock-prefixed instructions in terms
of the ordering it imposes (lock-prefixed instructions forbid
reordering and also have a total ordering across all cores).

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 15:16         ` Rich Felker
@ 2015-08-16 15:50           ` Jens Gustedt
  2015-08-16 15:58             ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2015-08-16 15:50 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4004 bytes --]

Am Sonntag, den 16.08.2015, 11:16 -0400 schrieb Rich Felker:
> On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > Hello,
> > 
> > Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker:
> > > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> > > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > > > > according to the wisdom of the Internet, e.g
> > > > > > 
> > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > > > > 
> > > > > > a mfence instruction is about 3 times slower than an xchg instruction.
> > > > > 
> > > > > Uhg, then why does this instruction even exist if it does less and
> > > > > does it slower?
> > > > 
> > > > Because they do different things ?)
> > > > 
> > > > mfence is to synchronize all memory, xchg, at least at a first glance,
> > > > only one word.
> > > 
> > > No, any lock-prefixed instruction, or xchg which has a builtin lock,
> > > fully orders all memory accesses. Essentially it contains a builtin
> > > mfence.
> > 
> > Hm, I think mfence does a bit more than that. The three fence
> > instructions were introduced when they invented the asynchronous
> > ("non-temporal") move instructions that came with sse.
> > 
> > I don't think that "lock" instructions synchronize with these
> > asynchronous moves, so the two (lock instructions and fences) are just
> > different types of animals. And this answers perhaps your question
> > up-thread, why there is actually something like mfence.
> 
> The relevant text seems to be the Intel manual, Vol 3A, 8.2.2 Memory
> Ordering in P6 and More Recent Processor Families:
> 
> ----------------------------------------------------------------------
> Reads are not reordered with other reads.
> 
> Writes are not reordered with older reads.
> 
> Writes to memory are not reordered with other writes, with the
> following exceptions:
> —   writes executed with the CLFLUSH instruction;
> —   streaming stores (writes) executed with the non-temporal move
> instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> —   string operations (see Section 8.2.4.1).
> 
> Reads may be reordered with older writes to different locations but
> not with older writes to the same location. 
> 
> Reads or writes cannot be reordered with I/O instructions, locked
> instructions, or serializing instructions.
> 
> Reads cannot pass earlier LFENCE and MFENCE instructions.
> 
> Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> 
> LFENCE instructions cannot pass earlier reads.
> 
> SFENCE instructions cannot pass earlier writes.
> 
> MFENCE instructions cannot pass earlier reads or writes
> ----------------------------------------------------------------------
> 
> See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf
> 
> So mfence seems to be weaker than lock-prefixed instructions in terms
> of the ordering it imposes (lock-prefixed instructions forbid
> reordering and also have a total ordering across all cores).

Yes, it says so on page 8-26 that the fences are definitively not
serializing instructions.

(But what I tried to show in my previous mail still holds, the
instruction latency itself plays a big part in the efficiency of these
instructions.)

I read all of that as:

 - mfence can be used to achieve acq_rel ordering
 - none of the fences can be use to achieve seq_cst ordering

Wasn't the idea that all atomic.h functions implement sequential
consistency?

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 15:50           ` Jens Gustedt
@ 2015-08-16 15:58             ` Rich Felker
  2015-08-16 16:16               ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-16 15:58 UTC (permalink / raw)
  To: musl

On Sun, Aug 16, 2015 at 05:50:21PM +0200, Jens Gustedt wrote:
> > See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf
> > 
> > So mfence seems to be weaker than lock-prefixed instructions in terms
> > of the ordering it imposes (lock-prefixed instructions forbid
> > reordering and also have a total ordering across all cores).
> 
> Yes, it says so on page 8-26 that the fences are definitively not
> serializing instructions.
> 
> (But what I tried to show in my previous mail still holds, the
> instruction latency itself plays a big part in the efficiency of these
> instructions.)

I wasn't trying to contradict anything you've said, just expressing
the absurdity of mfence being slower than lock-prefixed instructions,
since it's a strictly-weaker operation.

> I read all of that as:
> 
>  - mfence can be used to achieve acq_rel ordering
>  - none of the fences can be use to achieve seq_cst ordering

By this you mean that only lock-prefixed instructions impose a total
order across all cores?

> Wasn't the idea that all atomic.h functions implement sequential
> consistency?

Yes, that's the intent, but I don't want to introduce 'major'
performance regressions fixing 'minor' failures to be seq_cst if
there's no observable misbehavior in the code using them. Still it
would be nice to know whether such failures still exist, and if so
where, so we can eventually clean this up.

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 15:58             ` Rich Felker
@ 2015-08-16 16:16               ` Jens Gustedt
  2015-08-16 18:30                 ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2015-08-16 16:16 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2559 bytes --]

Am Sonntag, den 16.08.2015, 11:58 -0400 schrieb Rich Felker:
> On Sun, Aug 16, 2015 at 05:50:21PM +0200, Jens Gustedt wrote:
> > > See page 330, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf
> > > 
> > > So mfence seems to be weaker than lock-prefixed instructions in terms
> > > of the ordering it imposes (lock-prefixed instructions forbid
> > > reordering and also have a total ordering across all cores).
> > 
> > Yes, it says so on page 8-26 that the fences are definitively not
> > serializing instructions.
> > 
> > (But what I tried to show in my previous mail still holds, the
> > instruction latency itself plays a big part in the efficiency of these
> > instructions.)
> 
> I wasn't trying to contradict anything you've said, just expressing
> the absurdity of mfence being slower than lock-prefixed instructions,
> since it's a strictly-weaker operation.

Yes, I got that :)

One argument that we neglected for the moment, is the impact on other
threads/cores. Even if such an mfence instruction may be more
expensive for the thread that issues it, it imposes less constraints
to other threads. Maybe overall this could be win?

> > I read all of that as:
> > 
> >  - mfence can be used to achieve acq_rel ordering
> >  - none of the fences can be use to achieve seq_cst ordering
> 
> By this you mean that only lock-prefixed instructions impose a total
> order across all cores?

Plus these very expensive complete serializing instructions that are
listed in the manual.

> > Wasn't the idea that all atomic.h functions implement sequential
> > consistency?
> 
> Yes, that's the intent, but I don't want to introduce 'major'
> performance regressions fixing 'minor' failures to be seq_cst if
> there's no observable misbehavior in the code using them.

Misbehavior here is really hard to track down. Especially having an
application that changes behavior if it is not guaranteed seq_cst is
probably quite difficult to observe.

> Still it
> would be nice to know whether such failures still exist, and if so
> where, so we can eventually clean this up.

Replacing "mfence" by "lock ; orl $0,(%%rsp)" would provide us with
security by not compromising performance :)

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 16:16               ` Jens Gustedt
@ 2015-08-16 18:30                 ` Rich Felker
  0 siblings, 0 replies; 15+ messages in thread
From: Rich Felker @ 2015-08-16 18:30 UTC (permalink / raw)
  To: musl

On Sun, Aug 16, 2015 at 06:16:41PM +0200, Jens Gustedt wrote:
> > Still it
> > would be nice to know whether such failures still exist, and if so
> > where, so we can eventually clean this up.
> 
> Replacing "mfence" by "lock ; orl $0,(%%rsp)" would provide us with
> security by not compromising performance :)

OK, I've done that for now. If we determine something else is better
in the future we can change it but for now I just want to reduce the
magnitude of the performance regression that will go in the release,
and this does it. :) Thanks.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 12:42       ` Jens Gustedt
  2015-08-16 15:16         ` Rich Felker
@ 2015-08-16 16:07         ` Rich Felker
  2015-08-16 16:34           ` Jens Gustedt
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-16 16:07 UTC (permalink / raw)
  To: musl

On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > random address (here the stack pointer) should be faster than a
> > > read-modify-write operation with exactly the address you want to deal
> > > with looks weird.
> > 
> > xchg on the atomic address in principle reads a cache line (the write
> > destination) that's known to be shared with other threads despite not
> > needing it, then modifies is. mov on the other hand just appends the
> > write to the local store buffer; reading the target cache line is not
> > necessary. The subsequent barrier then takes care of ordering issues
> > (just the one case x86 doesn't guarantee already).
> > 
> > At least that's what happens in theory. It seemed to be slightly
> > faster in practice for me too, which is why I went with it for now
> > (theoretically faster + weak empirical evidence that it's faster =
> > reasonable basis for tentative conclusion, IMO) but I'm open to
> > further study of the different approaches and changing if we find
> > something else is better.
> 
> Yes, that theory probably isn't enough to describe things. If you look
> at the latency of a read-modify-write instructions you have something
> like
> 
> L = I + 2 M
> 
> where I is instruction latency itself, and M is the memory latency. M
> varies with the platform, chipset whatever, but it should be the same
> for all lock instructions on the same machine. Just to give an order
> of magnitude, M should be around 100.
> 
> "I" in turn can be very different for different lock instructions, and
> in addition varies for different processor variants. If I read the
> tables correctly it can be from about 10 to 100 cycles, so something
> really noticeable compared to M.

I'm confused where you're getting this from. The instruction latency
for most basic ALU and mov type operations on x86 is in the
single-digit range (and I mean binary digits, in many cases ;-).

> What you'd like to have for a store instruction is
> 
> L_{store} = I_{store} + M
> 
> avoiding the second M for the return from memory. (The second M can be
> hidden until we hit the next load instruction)
> 
> Your approach with orl has
> 
> I_{mov} + I_{or} + M
> 
> the original one has
> 
> I_{mov} + I_{mfence} + 2 M
> 
> which gives you an extra M
> 
> the xchg approach has
> 
> I_{xchg} + 2 M
> 
> From the graphs we can deduce that in fact the orl strategy is more
> stable that the two others, but still the 20% gap for the "normal"
> situation worries me.

Where are you measuring "20% gap for the 'normal' situation"? I'm
missing some context here.

> > Ultimately I think I'd like to get rid of a_store at some point anyway
> > since I think we can do better without it.
> 
> At least in this context, yes. From what I see at the moment the best
> strategy for the initial part of a locking primitive is to only use
> cmpxchg instructions. This instruction sets "cc", so if is integrated
> properly by the compiler, it can be used to conditionally jump,
> without even inspecting the value that had previously been in memory.
> 
> Currently within musl such a close integration is difficult, because
> you can't easily jump from within __asm__. When using my stdatomic
> implementation (only building upon the builtin compiler support for
> atomics) the generated code is much better.

While I'd like to be able to use cc/flags directly, I don't think this
is a worthwhile optimization, at least on x86. The difference in
performance is totally dominated by the cost of the atomic op, and the
old-value output of cmpxchg is a register anyway so there's no extra
memory-read operation. Actually I experimented with GCC asm-goto
labels for sh4a ll/sc atomics, because getting the result out of the
cc bit into a register than testing it is mildly expensive there, and
gcc generated amazingly good code, but the problems are:

(1) it depends on a nasty gcc feature that other compilers don't
implement well or at all, and

(2) it imposes a branch instruction inside the inline asm even if the
caller wants to throw away the result.

In any case, this does not seem to be a worthwhile area to optimize.
The real costs of atomics are the memory latency and cache
synchronization overhead, not any time spent in instructions working
purely with registers.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 16:07         ` Rich Felker
@ 2015-08-16 16:34           ` Jens Gustedt
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Gustedt @ 2015-08-16 16:34 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 6235 bytes --]

Am Sonntag, den 16.08.2015, 12:07 -0400 schrieb Rich Felker:
> On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > > random address (here the stack pointer) should be faster than a
> > > > read-modify-write operation with exactly the address you want to deal
> > > > with looks weird.
> > > 
> > > xchg on the atomic address in principle reads a cache line (the write
> > > destination) that's known to be shared with other threads despite not
> > > needing it, then modifies is. mov on the other hand just appends the
> > > write to the local store buffer; reading the target cache line is not
> > > necessary. The subsequent barrier then takes care of ordering issues
> > > (just the one case x86 doesn't guarantee already).
> > > 
> > > At least that's what happens in theory. It seemed to be slightly
> > > faster in practice for me too, which is why I went with it for now
> > > (theoretically faster + weak empirical evidence that it's faster =
> > > reasonable basis for tentative conclusion, IMO) but I'm open to
> > > further study of the different approaches and changing if we find
> > > something else is better.
> > 
> > Yes, that theory probably isn't enough to describe things. If you look
> > at the latency of a read-modify-write instructions you have something
> > like
> > 
> > L = I + 2 M
> > 
> > where I is instruction latency itself, and M is the memory latency. M
> > varies with the platform, chipset whatever, but it should be the same
> > for all lock instructions on the same machine. Just to give an order
> > of magnitude, M should be around 100.
> > 
> > "I" in turn can be very different for different lock instructions, and
> > in addition varies for different processor variants. If I read the
> > tables correctly it can be from about 10 to 100 cycles, so something
> > really noticeable compared to M.
> 
> I'm confused where you're getting this from. The instruction latency
> for most basic ALU and mov type operations on x86 is in the
> single-digit range (and I mean binary digits, in many cases ;-).

I was talking about the latencies of locked instructions, they are
significantly higher. So if you want to discuss overall latency of an
instruction that guarantees memory order, you'd have to take these
latencies into account.

> > What you'd like to have for a store instruction is
> > 
> > L_{store} = I_{store} + M
> > 
> > avoiding the second M for the return from memory. (The second M can be
> > hidden until we hit the next load instruction)
> > 
> > Your approach with orl has
> > 
> > I_{mov} + I_{or} + M
> > 
> > the original one has
> > 
> > I_{mov} + I_{mfence} + 2 M
> > 
> > which gives you an extra M
> > 
> > the xchg approach has
> > 
> > I_{xchg} + 2 M
> > 
> > From the graphs we can deduce that in fact the orl strategy is more
> > stable that the two others, but still the 20% gap for the "normal"
> > situation worries me.
> 
> Where are you measuring "20% gap for the 'normal' situation"? I'm
> missing some context here.

In the graphics that I had attached. For 1 thread, the performance
improvement against the "mfence" version with the "orl" trick is 20%,
but the "xchg" gets even 20% more.

I didn't measure this, but before the introduction of the "mfence" to
"a_store" the performance should have been at least as for the "xchg"
version.

> > > Ultimately I think I'd like to get rid of a_store at some point anyway
> > > since I think we can do better without it.
> > 
> > At least in this context, yes. From what I see at the moment the best
> > strategy for the initial part of a locking primitive is to only use
> > cmpxchg instructions. This instruction sets "cc", so if is integrated
> > properly by the compiler, it can be used to conditionally jump,
> > without even inspecting the value that had previously been in memory.
> > 
> > Currently within musl such a close integration is difficult, because
> > you can't easily jump from within __asm__. When using my stdatomic
> > implementation (only building upon the builtin compiler support for
> > atomics) the generated code is much better.
> 
> While I'd like to be able to use cc/flags directly, I don't think this
> is a worthwhile optimization, at least on x86. The difference in
> performance is totally dominated by the cost of the atomic op, and the
> old-value output of cmpxchg is a register anyway so there's no extra
> memory-read operation. Actually I experimented with GCC asm-goto
> labels for sh4a ll/sc atomics, because getting the result out of the
> cc bit into a register than testing it is mildly expensive there, and
> gcc generated amazingly good code, but the problems are:
> 
> (1) it depends on a nasty gcc feature that other compilers don't
> implement well or at all, and
> 
> (2) it imposes a branch instruction inside the inline asm even if the
> caller wants to throw away the result.

Yes, I definitively agree. That is what you can avoid by using the
compiler's own constructs, the __sync, __atomic or __c11_atomic
buitlins. Here they know about the features of the cmpexchg
instruction and can do appropriate jumps.

> In any case, this does not seem to be a worthwhile area to optimize.
> The real costs of atomics are the memory latency and cache
> synchronization overhead, not any time spent in instructions working
> purely with registers.

I don't agree. First this was not completely what I meant. I think
being able to use the CC to jump can also abort some fetches of memory
that are still in the pipelin, so this is perhaps more than some
"register" instructions.

Then, at the beginning of all lock constructs we have tight spin
loops. As far as I can tell, in case of contention every instruction
in them counts.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-15  6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt
  2015-08-15 20:17 ` Rich Felker
@ 2015-08-16 16:28 ` Rich Felker
  2015-08-16 16:38   ` Jens Gustedt
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2015-08-16 16:28 UTC (permalink / raw)
  To: musl

On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> according to the wisdom of the Internet, e.g
> 
> https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> 
> a mfence instruction is about 3 times slower than an xchg instruction.

I can't find where the article makes this claim. Could you point out
what part you're referring to?

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 16:28 ` Rich Felker
@ 2015-08-16 16:38   ` Jens Gustedt
  2015-08-16 17:00     ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2015-08-16 16:38 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]

Am Sonntag, den 16.08.2015, 12:28 -0400 schrieb Rich Felker:
> On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > according to the wisdom of the Internet, e.g
> > 
> > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > 
> > a mfence instruction is about 3 times slower than an xchg instruction.
> 
> I can't find where the article makes this claim. Could you point out
> what part you're referring to?

I read this in the section that says "Performance comparsion".

There it says something like "lock xchg" 16x baseline and "smfence"
47-67 x baseline. But perhaps I am misinterpreting things.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] replace a mfence instruction by an xchg instruction
  2015-08-16 16:38   ` Jens Gustedt
@ 2015-08-16 17:00     ` Rich Felker
  0 siblings, 0 replies; 15+ messages in thread
From: Rich Felker @ 2015-08-16 17:00 UTC (permalink / raw)
  To: musl

On Sun, Aug 16, 2015 at 06:38:32PM +0200, Jens Gustedt wrote:
> Am Sonntag, den 16.08.2015, 12:28 -0400 schrieb Rich Felker:
> > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > according to the wisdom of the Internet, e.g
> > > 
> > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > 
> > > a mfence instruction is about 3 times slower than an xchg instruction.
> > 
> > I can't find where the article makes this claim. Could you point out
> > what part you're referring to?
> 
> I read this in the section that says "Performance comparsion".
> 
> There it says something like "lock xchg" 16x baseline and "smfence"
> 47-67 x baseline. But perhaps I am misinterpreting things.

This is comparing an idiotic (and invalid, but maybe barely-working on
x86) Dekker lock using mfence with a simple xchg-based lock. It's not
comparing mfence to lock-xchg as barriers.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-08-16 18:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-15  6:51 [PATCH] replace a mfence instruction by an xchg instruction Jens Gustedt
2015-08-15 20:17 ` Rich Felker
2015-08-15 21:01   ` Jens Gustedt
2015-08-15 23:28     ` Rich Felker
2015-08-16 12:42       ` Jens Gustedt
2015-08-16 15:16         ` Rich Felker
2015-08-16 15:50           ` Jens Gustedt
2015-08-16 15:58             ` Rich Felker
2015-08-16 16:16               ` Jens Gustedt
2015-08-16 18:30                 ` Rich Felker
2015-08-16 16:07         ` Rich Felker
2015-08-16 16:34           ` Jens Gustedt
2015-08-16 16:28 ` Rich Felker
2015-08-16 16:38   ` Jens Gustedt
2015-08-16 17:00     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).