mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: [PATCH] replace a mfence instruction by an xchg instruction
Date: Sun, 16 Aug 2015 12:07:40 -0400	[thread overview]
Message-ID: <20150816160740.GQ31018@brightrain.aerifal.cx> (raw)
In-Reply-To: <1439728953.9803.30.camel@inria.fr>

On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > random address (here the stack pointer) should be faster than a
> > > read-modify-write operation with exactly the address you want to deal
> > > with looks weird.
> > 
> > xchg on the atomic address in principle reads a cache line (the write
> > destination) that's known to be shared with other threads despite not
> > needing it, then modifies is. mov on the other hand just appends the
> > write to the local store buffer; reading the target cache line is not
> > necessary. The subsequent barrier then takes care of ordering issues
> > (just the one case x86 doesn't guarantee already).
> > 
> > At least that's what happens in theory. It seemed to be slightly
> > faster in practice for me too, which is why I went with it for now
> > (theoretically faster + weak empirical evidence that it's faster =
> > reasonable basis for tentative conclusion, IMO) but I'm open to
> > further study of the different approaches and changing if we find
> > something else is better.
> 
> Yes, that theory probably isn't enough to describe things. If you look
> at the latency of a read-modify-write instructions you have something
> like
> 
> L = I + 2 M
> 
> where I is instruction latency itself, and M is the memory latency. M
> varies with the platform, chipset whatever, but it should be the same
> for all lock instructions on the same machine. Just to give an order
> of magnitude, M should be around 100.
> 
> "I" in turn can be very different for different lock instructions, and
> in addition varies for different processor variants. If I read the
> tables correctly it can be from about 10 to 100 cycles, so something
> really noticeable compared to M.

I'm confused where you're getting this from. The instruction latency
for most basic ALU and mov type operations on x86 is in the
single-digit range (and I mean binary digits, in many cases ;-).

> What you'd like to have for a store instruction is
> 
> L_{store} = I_{store} + M
> 
> avoiding the second M for the return from memory. (The second M can be
> hidden until we hit the next load instruction)
> 
> Your approach with orl has
> 
> I_{mov} + I_{or} + M
> 
> the original one has
> 
> I_{mov} + I_{mfence} + 2 M
> 
> which gives you an extra M
> 
> the xchg approach has
> 
> I_{xchg} + 2 M
> 
> From the graphs we can deduce that in fact the orl strategy is more
> stable that the two others, but still the 20% gap for the "normal"
> situation worries me.

Where are you measuring "20% gap for the 'normal' situation"? I'm
missing some context here.

> > Ultimately I think I'd like to get rid of a_store at some point anyway
> > since I think we can do better without it.
> 
> At least in this context, yes. From what I see at the moment the best
> strategy for the initial part of a locking primitive is to only use
> cmpxchg instructions. This instruction sets "cc", so if is integrated
> properly by the compiler, it can be used to conditionally jump,
> without even inspecting the value that had previously been in memory.
> 
> Currently within musl such a close integration is difficult, because
> you can't easily jump from within __asm__. When using my stdatomic
> implementation (only building upon the builtin compiler support for
> atomics) the generated code is much better.

While I'd like to be able to use cc/flags directly, I don't think this
is a worthwhile optimization, at least on x86. The difference in
performance is totally dominated by the cost of the atomic op, and the
old-value output of cmpxchg is a register anyway so there's no extra
memory-read operation. Actually I experimented with GCC asm-goto
labels for sh4a ll/sc atomics, because getting the result out of the
cc bit into a register than testing it is mildly expensive there, and
gcc generated amazingly good code, but the problems are:

(1) it depends on a nasty gcc feature that other compilers don't
implement well or at all, and

(2) it imposes a branch instruction inside the inline asm even if the
caller wants to throw away the result.

In any case, this does not seem to be a worthwhile area to optimize.
The real costs of atomics are the memory latency and cache
synchronization overhead, not any time spent in instructions working
purely with registers.

Rich


  parent reply	other threads:[~2015-08-16 16:07 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-15  6:51 Jens Gustedt
2015-08-15 20:17 ` Rich Felker
2015-08-15 21:01   ` Jens Gustedt
2015-08-15 23:28     ` Rich Felker
2015-08-16 12:42       ` Jens Gustedt
2015-08-16 15:16         ` Rich Felker
2015-08-16 15:50           ` Jens Gustedt
2015-08-16 15:58             ` Rich Felker
2015-08-16 16:16               ` Jens Gustedt
2015-08-16 18:30                 ` Rich Felker
2015-08-16 16:07         ` Rich Felker [this message]
2015-08-16 16:34           ` Jens Gustedt
2015-08-16 16:28 ` Rich Felker
2015-08-16 16:38   ` Jens Gustedt
2015-08-16 17:00     ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150816160740.GQ31018@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).