Refactoring atomics as llsc?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Refactoring atomics as llsc?
@ 2015-05-20  5:11 Rich Felker
  2015-05-20  5:33 ` Timo Teras
  2015-05-21  4:12 ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Rich Felker @ 2015-05-20  5:11 UTC (permalink / raw)
  To: musl

In the inline sh4a atomics thread, I discussed an idea of refactoring
atomics for llsc-style archs so that the arch would just need to
provide inline asm ll() and sc() functions (with more namespace-clean
names of course), and a shared atomic.h could build the a_* functions
on top of that, as in:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	do old = ll(p);
	while (old == t && !sc(p, s));
	return old;
}

(Note: I've omitted barriers for simplicity; they could be in the ll
and sc functions but it would probably make more sense to have them
outside the loop.)

In the sh4a thread, I was somewhat discouraged with this approach
because there's no way to model the output of the sc asm coming out in
a condition flag; the inline asm would have to convert the flag to a
value in a register. However, looking at the archs we have now, only
or1k, powerpc, and sh put the sc output in a condition flag. The rest
of the llsc-type archs leave the result as a boolean value in a
general-purpose register. So only a few archs would be negatively
affected, and only in a very minor way.

On the other hand, the benefits of doing this would be pretty
significant:

First of all, basically all archs could share all the implementation
logic for atomics, massively reducing the amount of per-arch asm and
the risk of subtle errors. Right now the only way to keep per-arch asm
down is to implement just a_cas and have everything else be a wrapper
for a_cas, but that leads to all the other atomics being a pair of
nested do/while loops, which is larger and slower than having a
"native" version of each op.

Second, this approach would allow us to add some really nice new
atomic primitives without having to write them per-arch: things like
atomic-dec-if-positive. Normally these would be written with a CAS
retry loop, which is a pair of nested do/while loops on llsc archs,
but with the above approach it would be a single loop.

Of course the big outlier is x86, which is not llsc based but has
actual atomic primitives at the instruction level. If we defined the
sc() primitive to take 3 args instead of 2 (address, old value from
ll, new value to conditionally store; most archs would ignore the old
value argument) then we could model x86 with ll being a plain load and
sc being cmpxchg to allow any new custom primitives to work using
cmpxchg. Then we would just continue providing custom versions of all
the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or,
a_swap) to take advantage of the x86 instructions. These versions
could probably be shared by all x86 variants (i386, x86_64, x32) since
they're operating on 32-bit values and the asm should be the same.

If we decide to go this way, it would replace the atomic.h refactoring
work I already sent to the list (Deduplicating atomics written in
terms of CAS).

For the few archs that would be adversely affected (albeit very minor)
by this approach due to the inability to model condition-flags in asm
outputs, we could in principle still keep some or all of the old asm
(probably cutting it down to the primitives that actually matter for
performance) if desired.

We could also keep the concept of atomic_generic.h using __sync
builtins, but offer ll() and sc() using a plain load for ll and
__sync_bool_compare_and_swap for sc, and then define a few other a_*
functions that the __sync primitives allow us to define directly.

Comments?

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-20  5:11 Refactoring atomics as llsc? Rich Felker
@ 2015-05-20  5:33 ` Timo Teras
  2015-05-20  6:19   ` Rich Felker
  2015-05-20  6:36   ` Rich Felker
  2015-05-21  4:12 ` Rich Felker
  1 sibling, 2 replies; 9+ messages in thread
From: Timo Teras @ 2015-05-20  5:33 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Wed, 20 May 2015 01:11:08 -0400
Rich Felker <dalias@libc.org> wrote:

> Of course the big outlier is x86, which is not llsc based but has
> actual atomic primitives at the instruction level. If we defined the
> sc() primitive to take 3 args instead of 2 (address, old value from
> ll, new value to conditionally store; most archs would ignore the old
> value argument) then we could model x86 with ll being a plain load and
> sc being cmpxchg to allow any new custom primitives to work using
> cmpxchg. Then we would just continue providing custom versions of all
> the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or,
> a_swap) to take advantage of the x86 instructions. These versions
> could probably be shared by all x86 variants (i386, x86_64, x32) since
> they're operating on 32-bit values and the asm should be the same.

I wonder if calling that kind of emulation ll()/sc() would be
misleading. load-linked store-conditional has stronger guarantees. sc
will fail if the cache-line was invalidated in-between, thread was
pre-empted etc.

Using cmpxchg can be used to emulate it only when the user is aware of
ABA problem (some other thread may have changed the value behind us
multiple times). Such emulation is of course ok for a_fetch_add, etc.
But one needs to be more careful if using pointers (and trying to make
sure the same pointer was not first removed and later re-added).

And if you want to optimize the above mentioned cases, one really needs
to know if it's true ll+sc, or write the synchronization differently.
In these cases the algorithm is often implemented twice with the
different available atomics.

/Timo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-20  5:33 ` Timo Teras
@ 2015-05-20  6:19   ` Rich Felker
  2015-05-20  6:36   ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2015-05-20  6:19 UTC (permalink / raw)
  To: musl

On Wed, May 20, 2015 at 08:33:23AM +0300, Timo Teras wrote:
> On Wed, 20 May 2015 01:11:08 -0400
> Rich Felker <dalias@libc.org> wrote:
> 
> > Of course the big outlier is x86, which is not llsc based but has
> > actual atomic primitives at the instruction level. If we defined the
> > sc() primitive to take 3 args instead of 2 (address, old value from
> > ll, new value to conditionally store; most archs would ignore the old
> > value argument) then we could model x86 with ll being a plain load and
> > sc being cmpxchg to allow any new custom primitives to work using
> > cmpxchg. Then we would just continue providing custom versions of all
> > the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or,
> > a_swap) to take advantage of the x86 instructions. These versions
> > could probably be shared by all x86 variants (i386, x86_64, x32) since
> > they're operating on 32-bit values and the asm should be the same.
> 
> I wonder if calling that kind of emulation ll()/sc() would be
> misleading. load-linked store-conditional has stronger guarantees. sc
> will fail if the cache-line was invalidated in-between, thread was
> pre-empted etc.
> 
> Using cmpxchg can be used to emulate it only when the user is aware of
> ABA problem (some other thread may have changed the value behind us
> multiple times). Such emulation is of course ok for a_fetch_add, etc.
> But one needs to be more careful if using pointers (and trying to make
> sure the same pointer was not first removed and later re-added).
> 
> And if you want to optimize the above mentioned cases, one really needs
> to know if it's true ll+sc, or write the synchronization differently.
> In these cases the algorithm is often implemented twice with the
> different available atomics.

Yes. The intent is not to expose ll/sc to musl source files, merely to
use them as a basis for generic implementations of the existing atomic
primitives and perhaps some new ones that might be interesting for
improvements to semaphores or for other special tasks. You're right
that we should be careful, if doing a 'fake' ll/sc, to document that
it is only usable for direct value operations not subject to ABA
issues. As long as we don't go adding new atomic ops on pointers
(other than the a_cas_p we have now) I don't see any such issues being
likely to arise, though.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-20  5:33 ` Timo Teras
  2015-05-20  6:19   ` Rich Felker
@ 2015-05-20  6:36   ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2015-05-20  6:36 UTC (permalink / raw)
  To: musl

On Wed, May 20, 2015 at 08:33:23AM +0300, Timo Teras wrote:
> On Wed, 20 May 2015 01:11:08 -0400
> Rich Felker <dalias@libc.org> wrote:
> 
> > Of course the big outlier is x86, which is not llsc based but has
> > actual atomic primitives at the instruction level. If we defined the
> > sc() primitive to take 3 args instead of 2 (address, old value from
> > ll, new value to conditionally store; most archs would ignore the old
> > value argument) then we could model x86 with ll being a plain load and
> > sc being cmpxchg to allow any new custom primitives to work using
> > cmpxchg. Then we would just continue providing custom versions of all
> > the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or,
> > a_swap) to take advantage of the x86 instructions. These versions
> > could probably be shared by all x86 variants (i386, x86_64, x32) since
> > they're operating on 32-bit values and the asm should be the same.
> 
> I wonder if calling that kind of emulation ll()/sc() would be
> misleading. load-linked store-conditional has stronger guarantees. sc
> will fail if the cache-line was invalidated in-between, thread was
> pre-empted etc.
> 
> Using cmpxchg can be used to emulate it only when the user is aware of
> ABA problem (some other thread may have changed the value behind us
> multiple times). Such emulation is of course ok for a_fetch_add, etc.
> But one needs to be more careful if using pointers (and trying to make
> sure the same pointer was not first removed and later re-added).
> 
> And if you want to optimize the above mentioned cases, one really needs
> to know if it's true ll+sc, or write the synchronization differently.
> In these cases the algorithm is often implemented twice with the
> different available atomics.

And yes, an alternative would be not to provide fake ll/sc for archs
without it but instead to have the existing generic cas-based
implementations to be used when ll/sc is not available. Then we'd have
2 generic implementations of everything instead of just one, but it
would probably be cleaner.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-20  5:11 Refactoring atomics as llsc? Rich Felker
  2015-05-20  5:33 ` Timo Teras
@ 2015-05-21  4:12 ` Rich Felker
  2015-05-21 12:06   ` Szabolcs Nagy
  1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2015-05-21  4:12 UTC (permalink / raw)
  To: musl

On Wed, May 20, 2015 at 01:11:08AM -0400, Rich Felker wrote:
> In the inline sh4a atomics thread, I discussed an idea of refactoring
> atomics for llsc-style archs so that the arch would just need to
> provide inline asm ll() and sc() functions (with more namespace-clean
> names of course), and a shared atomic.h could build the a_* functions
> on top of that, as in:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	do old = ll(p);
> 	while (old == t && !sc(p, s));
> 	return old;
> }
> 
> (Note: I've omitted barriers for simplicity; they could be in the ll
> and sc functions but it would probably make more sense to have them
> outside the loop.)

This is coming along really well so far. Here's the ARMv7 code
generated for a sample external x_swap function that calls a_swap:

x_swap:
        mov     r3, r0
        dmb ish
.L3:
        ldrex r0, [r3]
        strex r2,r1,[r3]
        cmp     r2, #0
        bne     .L3
        dmb ish
        bx      lr

The code that's producing this is the arm atomic_arch.h (so far only
supports inline atomics for v7+):

#define a_ll a_ll
static inline int a_ll(volatile int *p)
{
	int v;
	__asm__ __volatile__ ("ldrex %0, %1" : "=r"(v) : "Q"(*p));
	return v;
}

#define a_sc a_sc
static inline int a_sc(volatile int *p, int v)
{
	int r;
	__asm__ __volatile__ ("strex %0,%1,%2" : "=r"(r) : "r"(v), "Q"(*p) : "memory");
	return !r;
}

#define a_barrier a_barrier
static inline void a_barrier()
{
	__asm__ __volatile__ ("dmb ish" : : : "memory");
}

#define a_pre_llsc a_barrier
#define a_post_llsc a_barrier

And the relevant part of the generic atomic.h:

#ifndef a_swap
#define a_swap a_swap
static inline int a_swap(volatile int *p, int v)
{
	int old;
	a_pre_llsc();
	do old = a_ll(p);
	while (!a_sc(p, v));
	a_post_llsc();
	return old;
}
#endif

The a_pre_llsc and a_post_llsc functions/macros are provided to allow
the atomic_arch.h to put the first barrier inside the ll op if needed,
and to allow different pre/post barriers if the arch has more
efficient variants that can be used.

So far I'm really happy with how tiny atomic_arch.h is and how it can
give fully optimized versions of a_*. Of course for arm we need new
fallback code for pre-v7 versions, which will complicate things, and
sh also needs that kind of fallback. I still need to think about how
to best do these. Right now arm has suboptimal CAS-loop
implementations on pre-v7 even though v6 can do nice optimized llsc
versions. On the other hand, sh has optimized GUSA versions of all the
individual ops we have now.

I'm thinking perhaps the best solution is to have the generic llsc
implementations start off with a call to a fallback-check macro, which
would basically look like:

	if (need_fallback_a_swap) return fallback_a_swap(p, v);

This would cover pre-v6 arm, but for v6 we still want to use the llsc
code but with a different barrier. For that, we would want a_barrier
to look like:

#define a_barrier a_barrier
static inline void a_barrier()
{
	if (v6_compat) 
		__asm__ __volatile__ ( "mcr p15,0,r0,c7,c10,5" : : : "memory");
	else
		__asm__ __volatile__ ("dmb ish" : : : "memory");
}

Or else:

#define a_barrier a_barrier
static inline void a_barrier()
{
	__asm__ __volatile__ ( "blx %0" : : "r"(barrier_func_ptr) : "memory", "lr");
}

The asm is there so we can define a custom calling convention that
doesn't clobber any registers.

Unfortunately there's a nasty snag: global objects like
need_fallback_a_swap, v6_compat, or barrier_func_ptr will be re-read
over and over in functions using atomics because the "memory" clobbers
in the asm invalidate any value the compiler may have cached.

Fortunately, there seems to be a clean solution: load them via asm
that looks like

static inline int v6_compat() {
	int r;
	__asm__ ( "..." : "=r"(r) );
	return r;
}

where the "..." is asm to perform the load. Since this asm is not
volatile and has no inputs, it can be CSE'd and treated like an
attribute-const function. Strictly speaking this doesn't prevent
reordering to the very beginning of program execution, before the
runtime atomic selection is initialized, but I don't think that's a
serious practical concern. It's certainly not a concern with dynamic
linking since nothing can be reordered back into dynamic-linker-time,
and the atomics would be initialized there. For static-linking LTO
this may require some more thought for formal correctness.

It should be noted that the current arm atomics fallback code is a
very neat hack that's no longer needed thanks to the dynamic linker
overhaul, so we have a lot more flexibility if we redo them.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-21  4:12 ` Rich Felker
@ 2015-05-21 12:06   ` Szabolcs Nagy
  2015-05-21 12:21     ` Alexander Monakov
  2015-05-21 17:07     ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Szabolcs Nagy @ 2015-05-21 12:06 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2015-05-21 00:12:37 -0400]:
> This is coming along really well so far. Here's the ARMv7 code
> generated for a sample external x_swap function that calls a_swap:
> 
> x_swap:
>         mov     r3, r0
>         dmb ish
> .L3:
>         ldrex r0, [r3]
>         strex r2,r1,[r3]
>         cmp     r2, #0
>         bne     .L3
>         dmb ish
>         bx      lr
> 
> The code that's producing this is the arm atomic_arch.h (so far only
> supports inline atomics for v7+):
> 
...
> #ifndef a_swap
> #define a_swap a_swap
> static inline int a_swap(volatile int *p, int v)
> {
> 	int old;
> 	a_pre_llsc();
> 	do old = a_ll(p);
> 	while (!a_sc(p, v));
> 	a_post_llsc();
> 	return old;
> }
> #endif
> 

nice

> Unfortunately there's a nasty snag: global objects like
> need_fallback_a_swap, v6_compat, or barrier_func_ptr will be re-read
> over and over in functions using atomics because the "memory" clobbers
> in the asm invalidate any value the compiler may have cached.
> 
> Fortunately, there seems to be a clean solution: load them via asm
> that looks like
> 
> static inline int v6_compat() {
> 	int r;
> 	__asm__ ( "..." : "=r"(r) );
> 	return r;
> }
> 
> where the "..." is asm to perform the load. Since this asm is not
> volatile and has no inputs, it can be CSE'd and treated like an
> attribute-const function. Strictly speaking this doesn't prevent
> reordering to the very beginning of program execution, before the
> runtime atomic selection is initialized, but I don't think that's a
> serious practical concern. It's certainly not a concern with dynamic
> linking since nothing can be reordered back into dynamic-linker-time,
> and the atomics would be initialized there. For static-linking LTO
> this may require some more thought for formal correctness.

does gcc cse that?

why is it guaranteed that r will be always the same?

(and how can gcc know the cost of the asm? it seems to
me that would be needed to determine if it's worth keeping
r in a reg or just rerun the asm every time)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-21 12:06   ` Szabolcs Nagy
@ 2015-05-21 12:21     ` Alexander Monakov
  2015-05-21 17:08       ` Rich Felker
  2015-05-21 17:07     ` Rich Felker
  1 sibling, 1 reply; 9+ messages in thread
From: Alexander Monakov @ 2015-05-21 12:21 UTC (permalink / raw)
  To: musl

On Thu, 21 May 2015, Szabolcs Nagy wrote:
> > Fortunately, there seems to be a clean solution: load them via asm
> > that looks like
> > 
> > static inline int v6_compat() {
> > 	int r;
> > 	__asm__ ( "..." : "=r"(r) );
> > 	return r;
> > }
> > 
> > where the "..." is asm to perform the load. Since this asm is not
> > volatile and has no inputs, it can be CSE'd and treated like an
> > attribute-const function. Strictly speaking this doesn't prevent
> > reordering to the very beginning of program execution, before the
> > runtime atomic selection is initialized, but I don't think that's a
> > serious practical concern. It's certainly not a concern with dynamic
> > linking since nothing can be reordered back into dynamic-linker-time,
> > and the atomics would be initialized there. For static-linking LTO
> > this may require some more thought for formal correctness.
> 
> does gcc cse that?
> 
> why is it guaranteed that r will be always the same?

The asm is not volatile, so the compiler can use its constraints to move it
like any other instruction.  In this case there's only one input and output
register, and no clobbers.

> (and how can gcc know the cost of the asm? it seems to
> me that would be needed to determine if it's worth keeping
> r in a reg or just rerun the asm every time)

While obviously any sort of exact cost can not be known, GCC uses the line
count of the asm, iirc, as an estimation of the number of instructions.

Alexander


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-21 12:06   ` Szabolcs Nagy
  2015-05-21 12:21     ` Alexander Monakov
@ 2015-05-21 17:07     ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2015-05-21 17:07 UTC (permalink / raw)
  To: musl

On Thu, May 21, 2015 at 02:06:12PM +0200, Szabolcs Nagy wrote:
> > Unfortunately there's a nasty snag: global objects like
> > need_fallback_a_swap, v6_compat, or barrier_func_ptr will be re-read
> > over and over in functions using atomics because the "memory" clobbers
> > in the asm invalidate any value the compiler may have cached.
> > 
> > Fortunately, there seems to be a clean solution: load them via asm
> > that looks like
> > 
> > static inline int v6_compat() {
> > 	int r;
> > 	__asm__ ( "..." : "=r"(r) );
> > 	return r;
> > }
> > 
> > where the "..." is asm to perform the load. Since this asm is not
> > volatile and has no inputs, it can be CSE'd and treated like an
> > attribute-const function. Strictly speaking this doesn't prevent
> > reordering to the very beginning of program execution, before the
> > runtime atomic selection is initialized, but I don't think that's a
> > serious practical concern. It's certainly not a concern with dynamic
> > linking since nothing can be reordered back into dynamic-linker-time,
> > and the atomics would be initialized there. For static-linking LTO
> > this may require some more thought for formal correctness.
> 
> does gcc cse that?
> 
> why is it guaranteed that r will be always the same?

By default asm is pure. This is so you can write things like fancy
arithmetic operations in asm, and in as sense it's the whole point of
having fine-grained input and output constraints. Use of asm volatile,
or input constraints referring to memory ("m" type, not just "r" type
that happen to contain a memory address) that may have been clobbered,
make it non-pure.

> (and how can gcc know the cost of the asm? it seems to
> me that would be needed to determine if it's worth keeping
> r in a reg or just rerun the asm every time)

I think it has some basic heuristics, but it won't necessarily keep it
in a register anyway; it might spill to stack. That's reloadable in
one instruction and should always be cached so it "should" always be
at least as fast or faster than the asm.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Refactoring atomics as llsc?
  2015-05-21 12:21     ` Alexander Monakov
@ 2015-05-21 17:08       ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2015-05-21 17:08 UTC (permalink / raw)
  To: musl

On Thu, May 21, 2015 at 03:21:21PM +0300, Alexander Monakov wrote:
> On Thu, 21 May 2015, Szabolcs Nagy wrote:
> > > Fortunately, there seems to be a clean solution: load them via asm
> > > that looks like
> > > 
> > > static inline int v6_compat() {
> > > 	int r;
> > > 	__asm__ ( "..." : "=r"(r) );
> > > 	return r;
> > > }
> > > 
> > > where the "..." is asm to perform the load. Since this asm is not
> > > volatile and has no inputs, it can be CSE'd and treated like an
> > > attribute-const function. Strictly speaking this doesn't prevent
> > > reordering to the very beginning of program execution, before the
> > > runtime atomic selection is initialized, but I don't think that's a
> > > serious practical concern. It's certainly not a concern with dynamic
> > > linking since nothing can be reordered back into dynamic-linker-time,
> > > and the atomics would be initialized there. For static-linking LTO
> > > this may require some more thought for formal correctness.
> > 
> > does gcc cse that?
> > 
> > why is it guaranteed that r will be always the same?
> 
> The asm is not volatile, so the compiler can use its constraints to move it
> like any other instruction.  In this case there's only one input and output
> register, and no clobbers.
> 
> > (and how can gcc know the cost of the asm? it seems to
> > me that would be needed to determine if it's worth keeping
> > r in a reg or just rerun the asm every time)
> 
> While obviously any sort of exact cost can not be known, GCC uses the line
> count of the asm, iirc, as an estimation of the number of instructions.

Interesting. So can you use ; instead of \n for semantic purposes
controlling GCC's decision making? ;-)

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-21 17:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-20  5:11 Refactoring atomics as llsc? Rich Felker
2015-05-20  5:33 ` Timo Teras
2015-05-20  6:19   ` Rich Felker
2015-05-20  6:36   ` Rich Felker
2015-05-21  4:12 ` Rich Felker
2015-05-21 12:06   ` Szabolcs Nagy
2015-05-21 12:21     ` Alexander Monakov
2015-05-21 17:08       ` Rich Felker
2015-05-21 17:07     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).