atomic.h cleanup

mailing list of musl libc
 help / color / mirror / code / Atom feed

* atomic.h cleanup
@ 2016-01-10 12:21 Markus Wichmann
  2016-01-10 16:57 ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-01-10 12:21 UTC (permalink / raw)
  To: musl

Hi all,

The development roadmap on the musl wiki lists the ominous point
"atomic.h cleanup" for 1.2.0.

I assume you mean a sort of simplification and unification. I noticed
that for the RISC arch's there are rather liberal amounts of inline
assembly for the atomic operations. And I have always been taught, that
as soon as you start copying code, you are probably doing it wrong.

So first thing I'd do: add a new file, let's call it atomic_debruijn.h.
It contains an implementation of a_ctz() and a_ctz_64() based on the
DeBruijn number. That way, all the architectures currently implementing
a_ctz() in this manner can just include that file, and a lot of
duplicate code goes out the window.

Second thing: We can reduce the inline assembly footprint and the amount
of duplicate code by adding a new file, let's call it atomic_llsc.h,
that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(),
a_dec(), a_and() and a_or() in terms of new functions that would have to
be defined, namely:

static inline void a_presync(void) - execute any barrier needed before
attempting an atomic operation, like "dmb ish" for arm, or "sync" for
ppc.

static inline void a_postsync(void) - execute any barrier needed
afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM.

static inline int a_ll(int*) - perform an LL on the given pointer and
return the value there. This would be "lwarx" for PPC, or "ldrex" for
ARM.

static inline int a_sc(int*, int) - perform an SC on the given pointer
with the given value. Return zero iff that failed.

static inline void* a_ll_p(void*) - same as a_ll(), but with machine
words instead of int, if that's a difference.

static inline int a_sc_p(void*, void*) - same as a_sc(), but with
machine words.

With these function we can implement e.g. CAS as:

static inline int a_cas(volatile int *p, int t, int s)
{
    int v;
    do {
        v = a_ll(p);
        if (v != t)
            break;
    } while (!a_sc(p, s));
    return v;
}

Add some #ifdefs to only activate the pointer variations if they're
needed (i.e. if we're on 64 bits) and Bob's your uncle.

The only hardship would be in implementing a_sc(), but that can be
solved by using a feature often referenced but rarely seen in the wild:
ASM goto. How that works is that, if the arch's SC instruction returns
success or failure in a flag and the CPU can jump on that flag (unlike,
say, microblaze, which can only jump on comparisons), then you encode
the jump in the assembly snippet but let the compiler handle the targets
for you. Since in all cases, we want to jump on failure, that's what the
assembly should do, so for instance for PowerPC:

static inline int a_sc(volatile int* p, int x)
{
    __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
    return 1;
fail:
    return 0;
}

I already tried the compiler results for such a design, but I never
tried running it for lack of hardware.

Anyway, this code makes it possible for the compiler to redirect the
conditional jump on failure to the top of the loop in a_cas(). Since the
return value isn't used otherwise, the values 1 and 0 never appear in
the generated assembly.

What do you say to this design?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 12:21 atomic.h cleanup Markus Wichmann
@ 2016-01-10 16:57 ` Rich Felker
  2016-01-10 17:35   ` Markus Wichmann
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Rich Felker @ 2016-01-10 16:57 UTC (permalink / raw)
  To: musl

On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote:
> Hi all,
> 
> The development roadmap on the musl wiki lists the ominous point
> "atomic.h cleanup" for 1.2.0.
> 
> I assume you mean a sort of simplification and unification. I noticed
> that for the RISC arch's there are rather liberal amounts of inline
> assembly for the atomic operations. And I have always been taught, that
> as soon as you start copying code, you are probably doing it wrong.
> 
> So first thing I'd do: add a new file, let's call it atomic_debruijn.h.
> It contains an implementation of a_ctz() and a_ctz_64() based on the
> DeBruijn number. That way, all the architectures currently implementing
> a_ctz() in this manner can just include that file, and a lot of
> duplicate code goes out the window.
> 
> Second thing: We can reduce the inline assembly footprint and the amount
> of duplicate code by adding a new file, let's call it atomic_llsc.h,
> that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(),
> a_dec(), a_and() and a_or() in terms of new functions that would have to
> be defined, namely:
> 
> static inline void a_presync(void) - execute any barrier needed before
> attempting an atomic operation, like "dmb ish" for arm, or "sync" for
> ppc.
> 
> static inline void a_postsync(void) - execute any barrier needed
> afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM.
> 
> static inline int a_ll(int*) - perform an LL on the given pointer and
> return the value there. This would be "lwarx" for PPC, or "ldrex" for
> ARM.
> 
> static inline int a_sc(int*, int) - perform an SC on the given pointer
> with the given value. Return zero iff that failed.
> 
> static inline void* a_ll_p(void*) - same as a_ll(), but with machine
> words instead of int, if that's a difference.
> 
> static inline int a_sc_p(void*, void*) - same as a_sc(), but with
> machine words.
> 
> 
> With these function we can implement e.g. CAS as:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
>     int v;
>     do {
>         v = a_ll(p);
>         if (v != t)
>             break;
>     } while (!a_sc(p, s));
>     return v;
> }
> 
> Add some #ifdefs to only activate the pointer variations if they're
> needed (i.e. if we're on 64 bits) and Bob's your uncle.
> 
> The only hardship would be in implementing a_sc(), but that can be
> solved by using a feature often referenced but rarely seen in the wild:
> ASM goto. How that works is that, if the arch's SC instruction returns
> success or failure in a flag and the CPU can jump on that flag (unlike,
> say, microblaze, which can only jump on comparisons), then you encode
> the jump in the assembly snippet but let the compiler handle the targets
> for you. Since in all cases, we want to jump on failure, that's what the
> assembly should do, so for instance for PowerPC:
> 
> static inline int a_sc(volatile int* p, int x)
> {
>     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
>     return 1;
> fail:
>     return 0;
> }
> 
> I already tried the compiler results for such a design, but I never
> tried running it for lack of hardware.
> 
> Anyway, this code makes it possible for the compiler to redirect the
> conditional jump on failure to the top of the loop in a_cas(). Since the
> return value isn't used otherwise, the values 1 and 0 never appear in
> the generated assembly.
> 
> What do you say to this design?

Have you read this thread? :)

http://www.openwall.com/lists/musl/2015/05/20/1

I thought at one point it was linked from the wiki but maybe it got
lost.

Basically I have this done already outside of musl as an experiment,
but there are minor details that were holding it up. One annoyance is
that, on some archs, success/failure of "sc" comes via a condition
flag which the C caller can't easily branch on, so there's an extra
conversion to a boolean result inside the asm and extra conversion
back to a test/branch outside the asm. In practice we probably don't
care.

One other issue is that risc-v seems to guarantee, at least on some
implementations, stronger forward-progress guarantees than a normal
ll/sc as long as the ll/sc are in order, within a few instruction
slots of each other, with no branches between. Such conditions cannot
be met without putting them in the same asm block, so we might need to
do a custom version for risc-v if we want to take advantage of the
stronger properties.

Anyway, at this point the main obstacle to finishing the task is doing
the actual merging and testing, not any new coding, I think.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 16:57 ` Rich Felker
@ 2016-01-10 17:35   ` Markus Wichmann
  2016-01-10 17:50     ` Alexander Monakov
  2016-01-10 17:37   ` Markus Wichmann
  2016-01-22  0:09   ` Rich Felker
  2 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-01-10 17:35 UTC (permalink / raw)
  To: musl

On Sun, Jan 10, 2016 at 11:57:18AM -0500, Rich Felker wrote:
> On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote:
>> [...]
>> What do you say to this design?
> 
> Have you read this thread? :)
> 
> http://www.openwall.com/lists/musl/2015/05/20/1
> 
> I thought at one point it was linked from the wiki but maybe it got
> lost.
> 

Well, at least it's not on the Roadmap page. There should probably be a
wiki page for such open questions. Then people like me wouldn't have to ask
(or do extended searches in the mailing list archives, which at the
moment is only possible through your favorite search engine, BTW, which
is why I've not been doing that so often).

> Basically I have this done already outside of musl as an experiment,
> but there are minor details that were holding it up. One annoyance is
> that, on some archs, success/failure of "sc" comes via a condition
> flag which the C caller can't easily branch on, so there's an extra
> conversion to a boolean result inside the asm and extra conversion
> back to a test/branch outside the asm. In practice we probably don't
> care.
> 

Yes, and my original message showed how to deal with that:

| static inline int a_sc(volatile int* p, int x)
| {
|     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
|     return 1;
| fail:
|     return 0;
| }

I tested the assembler output from both gcc and clang and it looks
alright to me (testcase attached).

As I said, we typically want to branch if SC fails, so that's how these
snippets should be written (not "branch if it succeeds", because AFAICS
the compiler can't rewrite the ASM snippets).

> One other issue is that risc-v seems to guarantee, at least on some
> implementations, stronger forward-progress guarantees than a normal
> ll/sc as long as the ll/sc are in order, within a few instruction
> slots of each other, with no branches between. Such conditions cannot
> be met without putting them in the same asm block, so we might need to
> do a custom version for risc-v if we want to take advantage of the
> stronger properties.
> 

Oh goodie, another term to research...

As far as I can see, RISC-V is another architecture. Since the scheme
I'm proposing is strictly opt-in, we can cross that bridge once we come
to it. For instance, this scheme is insufficient to support SuperH and
its three different versions of a_cas(). So I'd just leave that be for
the moment.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 16:57 ` Rich Felker
  2016-01-10 17:35   ` Markus Wichmann
@ 2016-01-10 17:37   ` Markus Wichmann
  2016-01-22  0:09   ` Rich Felker
  2 siblings, 0 replies; 14+ messages in thread
From: Markus Wichmann @ 2016-01-10 17:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 79 bytes --]

Hi all,

And once more with the promised attachments. I'm sorry.

Ciao,
Markus

[-- Attachment #2: cas.c --]
[-- Type: text/x-csrc, Size: 477 bytes --]

static inline int a_ll(volatile int *p) {
    int x;
    asm volatile ("lwarx %0, 0, %1" : "=r"(x) : "r"(p) : "memory");
    return x;
}

static inline int a_sc(volatile int *p, int x) {
    asm goto ("stwcx. %0, 0, %1\nbne- %l2" : : "r"(x), "r"(p) : "cc","memory" : fail);
    return 1;
fail:
    return 0;
}

int a_cas(volatile int *p, int t, int s) {
    int x;
    do {
        x = a_ll(p);
        if (x != t)
            break;
    } while (!a_sc(p, s));
    return x;
}

[-- Attachment #3: cas.s --]
[-- Type: text/plain, Size: 359 bytes --]

	.file	"cas.c"
	.section	".text"
	.align 2
	.globl a_cas
	.type	a_cas, @function
a_cas:
.L2:
#APP
 # 3 "cas.c" 1
	lwarx 9, 0, 3
 # 0 "" 2
#NO_APP
	cmpw 7,4,9
	bne- 7,.L4
#APP
 # 8 "cas.c" 1
	stwcx. 5, 0, 3
bne- .L2
 # 0 "" 2
#NO_APP
.L3:
	mr 3,4
	blr
.L4:
	mr 4,9
	b .L3
	.size	a_cas, .-a_cas
	.ident	"GCC: (GNU) 4.9.2"
	.section	.note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 17:35   ` Markus Wichmann
@ 2016-01-10 17:50     ` Alexander Monakov
  2016-01-11 16:35       ` Markus Wichmann
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Monakov @ 2016-01-10 17:50 UTC (permalink / raw)
  To: musl

On Sun, 10 Jan 2016, Markus Wichmann wrote:
> | static inline int a_sc(volatile int* p, int x)
> | {
> |     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
> |     return 1;
> | fail:
> |     return 0;
> | }
> 
> I tested the assembler output from both gcc and clang and it looks
> alright to me (testcase attached).

"asm goto" is an extension that appeared in gcc 4.5 and is not supported in
clang (I get "error: expected '(' after 'asm'" with clang 3.7).  I don't know
why you claim it works with clang.

Corresponding llvm bugs (no plans to add support soon):
https://llvm.org/bugs/show_bug.cgi?id=9295
https://llvm.org/bugs/show_bug.cgi?id=14406

Alexander


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 17:50     ` Alexander Monakov
@ 2016-01-11 16:35       ` Markus Wichmann
  2016-01-11 17:12         ` Jens Gustedt
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-01-11 16:35 UTC (permalink / raw)
  To: musl

On Sun, Jan 10, 2016 at 08:50:28PM +0300, Alexander Monakov wrote:
> "asm goto" is an extension that appeared in gcc 4.5 and is not supported in
> clang (I get "error: expected '(' after 'asm'" with clang 3.7).  I don't know
> why you claim it works with clang.
> 

That's very simple: Because I misremembered. I had tested a lot of
things against gcc and clang and thought that file had been among them.
But it wasn't.

And I didn't test it because building a cross-gcc had been hard enough,
so I didn't have the energy to try a cross-clang on that day.

So I'm sorry, but it was an honest mistake.

> Corresponding llvm bugs (no plans to add support soon):
> https://llvm.org/bugs/show_bug.cgi?id=9295
> https://llvm.org/bugs/show_bug.cgi?id=14406
> 

Really? OK, so it's either suboptimal code for everyone or
compiler-specific better code. Why can't we have nice things?

OTOH, maybe we simply shouldn't write synchronisation primitives
ourselves and instead use the ones provided by GCC (and let other
compilers suck on a salty sausage, if they don't support those
primitives).

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-11 16:35       ` Markus Wichmann
@ 2016-01-11 17:12         ` Jens Gustedt
  2016-01-11 19:03           ` Szabolcs Nagy
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Gustedt @ 2016-01-11 17:12 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1513 bytes --]

Hello,

Am Montag, den 11.01.2016, 17:35 +0100 schrieb Markus Wichmann:
> Really? OK, so it's either suboptimal code for everyone or
> compiler-specific better code. Why can't we have nice things?

nice things would be portable atomics, no?

> OTOH, maybe we simply shouldn't write synchronisation primitives
> ourselves and instead use the ones provided by GCC (and let other
> compilers suck on a salty sausage, if they don't support those
> primitives).

I think on the long run we should use C11 atomics and leave the dirty
work to the compiler writers. To my experience they do good work with
that now, the assembler they produce looks nice.

My stdatomic library is sitting there, ready to integrate into
musl. It solves the problem of backwards compatibility for all
compilers that that implement the __sync builtins. (gcc and clang with
very old version numbers.)

Last time I looked, all usages but one of atomic operations in musl
are clean. If an atomic operation is used for a data a some point,
atomic operations are used in all other places. So moving to
_Atomic(int) would be a option. (Basically this would be `volatile
int*` => `_Atomic(int)`, IIRC).

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-11 17:12         ` Jens Gustedt
@ 2016-01-11 19:03           ` Szabolcs Nagy
  2016-01-11 20:56             ` Jens Gustedt
  0 siblings, 1 reply; 14+ messages in thread
From: Szabolcs Nagy @ 2016-01-11 19:03 UTC (permalink / raw)
  To: musl

* Jens Gustedt <jens.gustedt@inria.fr> [2016-01-11 18:12:29 +0100]:
> Am Montag, den 11.01.2016, 17:35 +0100 schrieb Markus Wichmann:
> > OTOH, maybe we simply shouldn't write synchronisation primitives
> > ourselves and instead use the ones provided by GCC (and let other
> > compilers suck on a salty sausage, if they don't support those
> > primitives).
> 
> I think on the long run we should use C11 atomics and leave the dirty
> work to the compiler writers. To my experience they do good work with
> that now, the assembler they produce looks nice.
> 

yes but old compilers had various bugs on various targets.

> My stdatomic library is sitting there, ready to integrate into
> musl. It solves the problem of backwards compatibility for all
> compilers that that implement the __sync builtins. (gcc and clang with
> very old version numbers.)
> 

i think simpler compilers like pcc, cparser, tcc
dont implement that.

if musl moves to compiler builtins then i'd
like to have a possibility to compile atomic
primitives as a separate tu

> Last time I looked, all usages but one of atomic operations in musl
> are clean. If an atomic operation is used for a data a some point,
> atomic operations are used in all other places. So moving to
> _Atomic(int) would be a option. (Basically this would be `volatile
> int*` => `_Atomic(int)`, IIRC).
> 

pthread_once_t and pthread_spinlock_t are
publicly visibles type (without volatile and
_Atomic)

i dont think we can fix those without abi
change.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-11 19:03           ` Szabolcs Nagy
@ 2016-01-11 20:56             ` Jens Gustedt
  2016-01-14 22:12               ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Gustedt @ 2016-01-11 20:56 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2846 bytes --]

Am Montag, den 11.01.2016, 20:03 +0100 schrieb Szabolcs Nagy:
> * Jens Gustedt <jens.gustedt@inria.fr> [2016-01-11 18:12:29 +0100]:
> > Am Montag, den 11.01.2016, 17:35 +0100 schrieb Markus Wichmann:
> > > OTOH, maybe we simply shouldn't write synchronisation primitives
> > > ourselves and instead use the ones provided by GCC (and let other
> > > compilers suck on a salty sausage, if they don't support those
> > > primitives).
> > 
> > I think on the long run we should use C11 atomics and leave the dirty
> > work to the compiler writers. To my experience they do good work with
> > that now, the assembler they produce looks nice.
> > 
> 
> yes but old compilers had various bugs on various targets.
> 
> > My stdatomic library is sitting there, ready to integrate into
> > musl. It solves the problem of backwards compatibility for all
> > compilers that that implement the __sync builtins. (gcc and clang with
> > very old version numbers.)
> > 
> 
> i think simpler compilers like pcc, cparser, tcc
> dont implement that.
> 
> if musl moves to compiler builtins then i'd
> like to have a possibility to compile atomic
> primitives as a separate tu

In a sense, stdatomic has that already. It also implements the atomic
operations as fallback functions, for the case that the compiler isn't
able to synthesise the operation.

But you are right, support for those simpler compilers then would mean
that we'd have to maintain stubs, at least for the most commonly used
4 byte operations.

> > Last time I looked, all usages but one of atomic operations in musl
> > are clean. If an atomic operation is used for a data a some point,
> > atomic operations are used in all other places. So moving to
> > _Atomic(int) would be a option. (Basically this would be `volatile
> > int*` => `_Atomic(int)`, IIRC).

oops I meant `volatile int*` => `_Atomic(int)*`

> pthread_once_t and pthread_spinlock_t are
> publicly visibles type (without volatile and
> _Atomic)
> 
> i dont think we can fix those without abi
> change.

This is really a question what ABI means in this case. The width,
alignment and representation of the `int` would stay the same, we
would just internally (to the library implementation) interpret it as
_Atomic(int).

Also it seems that we do such a re-interpretation already with
`volatile`. One interpretation of the standard says that the object
itself has to be `volatile`, just casting a pointer to `volatile int*`
doesn't inhibit optimizations.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-11 20:56             ` Jens Gustedt
@ 2016-01-14 22:12               ` Rich Felker
  2016-01-14 22:37                 ` Jens Gustedt
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2016-01-14 22:12 UTC (permalink / raw)
  To: musl

On Mon, Jan 11, 2016 at 09:56:50PM +0100, Jens Gustedt wrote:
> Am Montag, den 11.01.2016, 20:03 +0100 schrieb Szabolcs Nagy:
> > * Jens Gustedt <jens.gustedt@inria.fr> [2016-01-11 18:12:29 +0100]:
> > > Am Montag, den 11.01.2016, 17:35 +0100 schrieb Markus Wichmann:
> > > > OTOH, maybe we simply shouldn't write synchronisation primitives
> > > > ourselves and instead use the ones provided by GCC (and let other
> > > > compilers suck on a salty sausage, if they don't support those
> > > > primitives).
> > > 
> > > I think on the long run we should use C11 atomics and leave the dirty
> > > work to the compiler writers. To my experience they do good work with
> > > that now, the assembler they produce looks nice.
> > > 
> > 
> > yes but old compilers had various bugs on various targets.
> > 
> > > My stdatomic library is sitting there, ready to integrate into
> > > musl. It solves the problem of backwards compatibility for all
> > > compilers that that implement the __sync builtins. (gcc and clang with
> > > very old version numbers.)
> > > 
> > 
> > i think simpler compilers like pcc, cparser, tcc
> > dont implement that.
> > 
> > if musl moves to compiler builtins then i'd
> > like to have a possibility to compile atomic
> > primitives as a separate tu
> 
> In a sense, stdatomic has that already. It also implements the atomic
> operations as fallback functions, for the case that the compiler isn't
> able to synthesise the operation.
> 
> But you are right, support for those simpler compilers then would mean
> that we'd have to maintain stubs, at least for the most commonly used
> 4 byte operations.

There are already multiple reasons we don't use the compiler's
atomics, either directly or indirectly via stdatomic.h. They're not
supported in some old/alternative compilers, they generate highly
suboptimal code even on modern compilers for some important archs
(e.g. ARM), and they fail to properly support archs where it's
necessary to make a runtime choice of which atomic code paths to use
in order to achieve safe/correct behavior.

With the atomics overhaul I am planning to have an option (selected by
the arch headers, not the user) to use __sync_* as the backend for
atomics, which will ease porting to new archs where it already works
correctly on all compilers that support the arch.

> > > Last time I looked, all usages but one of atomic operations in musl
> > > are clean. If an atomic operation is used for a data a some point,
> > > atomic operations are used in all other places. So moving to
> > > _Atomic(int) would be a option. (Basically this would be `volatile
> > > int*` => `_Atomic(int)`, IIRC).
> 
> oops I meant `volatile int*` => `_Atomic(int)*`
> 
> > pthread_once_t and pthread_spinlock_t are
> > publicly visibles type (without volatile and
> > _Atomic)
> > 
> > i dont think we can fix those without abi
> > change.
> 
> This is really a question what ABI means in this case. The width,
> alignment and representation of the `int` would stay the same, we
> would just internally (to the library implementation) interpret it as
> _Atomic(int).

From a C++ perspective ABI certainly includes the type that will
appear in mangled function names. This is the main motivation for not
changing types like this. Of course LTO could also break when formal
types don't match.

> Also it seems that we do such a re-interpretation already with
> `volatile`. One interpretation of the standard says that the object
> itself has to be `volatile`, just casting a pointer to `volatile int*`
> doesn't inhibit optimizations.

GCC explicitly interprets it the other way, and documents it as such.
If we want to deal with compilers that don't provide such a guarantee,
and for which the accesses could break, we probably need an explicit
load construct in asm... But this seems like a low priorit and I'm
happy to wait to address it unless/until real-world problems seem
likely. For this reason it would be nice to document the assumption,
however, as Szabolcs Nagy suggested we do for issues like this.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-14 22:12               ` Rich Felker
@ 2016-01-14 22:37                 ` Jens Gustedt
  2016-01-14 23:32                   ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Gustedt @ 2016-01-14 22:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

Am Donnerstag, den 14.01.2016, 17:12 -0500 schrieb Rich Felker:
> There are already multiple reasons we don't use the compiler's
> atomics, either directly or indirectly via stdatomic.h. They're not
> supported in some old/alternative compilers, they generate highly
> suboptimal code even on modern compilers for some important archs
> (e.g. ARM),

I have seen some pretty good assembler when using the
__atomic... builtins, so I can't completely follow, here.

> and they fail to properly support archs where it's
> necessary to make a runtime choice of which atomic code paths to use
> in order to achieve safe/correct behavior.

ok

> > > > Last time I looked, all usages but one of atomic operations in musl
> > > > are clean. If an atomic operation is used for a data a some point,
> > > > atomic operations are used in all other places. So moving to
> > > > _Atomic(int) would be a option. (Basically this would be `volatile
> > > > int*` => `_Atomic(int)`, IIRC).
> > 
> > oops I meant `volatile int*` => `_Atomic(int)*`
> > 
> > > pthread_once_t and pthread_spinlock_t are
> > > publicly visibles type (without volatile and
> > > _Atomic)
> > > 
> > > i dont think we can fix those without abi
> > > change.
> > 
> > This is really a question what ABI means in this case. The width,
> > alignment and representation of the `int` would stay the same, we
> > would just internally (to the library implementation) interpret it as
> > _Atomic(int).
> 
> From a C++ perspective ABI certainly includes the type that will
> appear in mangled function names. This is the main motivation for not
> changing types like this. Of course LTO could also break when formal
> types don't match.

Yes, I know. I was thinking of a way to have the externally visible
type remain stable, but internally use _Atomic.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-14 22:37                 ` Jens Gustedt
@ 2016-01-14 23:32                   ` Rich Felker
  2016-01-15  0:46                     ` Szabolcs Nagy
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2016-01-14 23:32 UTC (permalink / raw)
  To: musl

On Thu, Jan 14, 2016 at 11:37:48PM +0100, Jens Gustedt wrote:
> Am Donnerstag, den 14.01.2016, 17:12 -0500 schrieb Rich Felker:
> > There are already multiple reasons we don't use the compiler's
> > atomics, either directly or indirectly via stdatomic.h. They're not
> > supported in some old/alternative compilers, they generate highly
> > suboptimal code even on modern compilers for some important archs
> > (e.g. ARM),
> 
> I have seen some pretty good assembler when using the
> __atomic... builtins, so I can't completely follow, here.

It generates "dmb sy" all over the place instead of "dmb ish".
Synchronizing with external bus devices is NOT something you want to
happen in thread synchronization primitives.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-14 23:32                   ` Rich Felker
@ 2016-01-15  0:46                     ` Szabolcs Nagy
  0 siblings, 0 replies; 14+ messages in thread
From: Szabolcs Nagy @ 2016-01-15  0:46 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2016-01-14 18:32:24 -0500]:
> On Thu, Jan 14, 2016 at 11:37:48PM +0100, Jens Gustedt wrote:
> > Am Donnerstag, den 14.01.2016, 17:12 -0500 schrieb Rich Felker:
> > > There are already multiple reasons we don't use the compiler's
> > > atomics, either directly or indirectly via stdatomic.h. They're not
> > > supported in some old/alternative compilers, they generate highly
> > > suboptimal code even on modern compilers for some important archs
> > > (e.g. ARM),
> > 
> > I have seen some pretty good assembler when using the
> > __atomic... builtins, so I can't completely follow, here.
> 
> It generates "dmb sy" all over the place instead of "dmb ish".
> Synchronizing with external bus devices is NOT something you want to
> happen in thread synchronization primitives.
> 

fixed in gcc-6 :)
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=224317

(not backported to earlier branches though)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: atomic.h cleanup
  2016-01-10 16:57 ` Rich Felker
  2016-01-10 17:35   ` Markus Wichmann
  2016-01-10 17:37   ` Markus Wichmann
@ 2016-01-22  0:09   ` Rich Felker
  2 siblings, 0 replies; 14+ messages in thread
From: Rich Felker @ 2016-01-22  0:09 UTC (permalink / raw)
  To: musl

On Sun, Jan 10, 2016 at 11:57:18AM -0500, Rich Felker wrote:
> On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote:
> > Hi all,
> > 
> > The development roadmap on the musl wiki lists the ominous point
> > "atomic.h cleanup" for 1.2.0.
> > 
> > I assume you mean a sort of simplification and unification. I noticed
> > that for the RISC arch's there are rather liberal amounts of inline
> > assembly for the atomic operations. And I have always been taught, that
> > as soon as you start copying code, you are probably doing it wrong.
> > 
> > So first thing I'd do: add a new file, let's call it atomic_debruijn.h.
> > It contains an implementation of a_ctz() and a_ctz_64() based on the
> > DeBruijn number. That way, all the architectures currently implementing
> > a_ctz() in this manner can just include that file, and a lot of
> > duplicate code goes out the window.
> > 
> > Second thing: We can reduce the inline assembly footprint and the amount
> > of duplicate code by adding a new file, let's call it atomic_llsc.h,
> > that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(),
> > a_dec(), a_and() and a_or() in terms of new functions that would have to
> > be defined, namely:
> > 
> > static inline void a_presync(void) - execute any barrier needed before
> > attempting an atomic operation, like "dmb ish" for arm, or "sync" for
> > ppc.
> > 
> > static inline void a_postsync(void) - execute any barrier needed
> > afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM.
> > 
> > static inline int a_ll(int*) - perform an LL on the given pointer and
> > return the value there. This would be "lwarx" for PPC, or "ldrex" for
> > ARM.
> > 
> > static inline int a_sc(int*, int) - perform an SC on the given pointer
> > with the given value. Return zero iff that failed.
> > 
> > static inline void* a_ll_p(void*) - same as a_ll(), but with machine
> > words instead of int, if that's a difference.
> > 
> > static inline int a_sc_p(void*, void*) - same as a_sc(), but with
> > machine words.
> > 
> > 
> > With these function we can implement e.g. CAS as:
> > 
> > static inline int a_cas(volatile int *p, int t, int s)
> > {
> >     int v;
> >     do {
> >         v = a_ll(p);
> >         if (v != t)
> >             break;
> >     } while (!a_sc(p, s));
> >     return v;
> > }
> > 
> > Add some #ifdefs to only activate the pointer variations if they're
> > needed (i.e. if we're on 64 bits) and Bob's your uncle.
> > 
> > The only hardship would be in implementing a_sc(), but that can be
> > solved by using a feature often referenced but rarely seen in the wild:
> > ASM goto. How that works is that, if the arch's SC instruction returns
> > success or failure in a flag and the CPU can jump on that flag (unlike,
> > say, microblaze, which can only jump on comparisons), then you encode
> > the jump in the assembly snippet but let the compiler handle the targets
> > for you. Since in all cases, we want to jump on failure, that's what the
> > assembly should do, so for instance for PowerPC:
> > 
> > static inline int a_sc(volatile int* p, int x)
> > {
> >     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
> >     return 1;
> > fail:
> >     return 0;
> > }
> > 
> > I already tried the compiler results for such a design, but I never
> > tried running it for lack of hardware.
> > 
> > Anyway, this code makes it possible for the compiler to redirect the
> > conditional jump on failure to the top of the loop in a_cas(). Since the
> > return value isn't used otherwise, the values 1 and 0 never appear in
> > the generated assembly.
> > 
> > What do you say to this design?
> 
> Have you read this thread? :)
> 
> http://www.openwall.com/lists/musl/2015/05/20/1
> 
> I thought at one point it was linked from the wiki but maybe it got
> lost.
> 
> Basically I have this done already outside of musl as an experiment,
> but there are minor details that were holding it up. One annoyance is
> that, on some archs, success/failure of "sc" comes via a condition
> flag which the C caller can't easily branch on, so there's an extra
> conversion to a boolean result inside the asm and extra conversion
> back to a test/branch outside the asm. In practice we probably don't
> care.
> 
> One other issue is that risc-v seems to guarantee, at least on some
> implementations, stronger forward-progress guarantees than a normal
> ll/sc as long as the ll/sc are in order, within a few instruction
> slots of each other, with no branches between. Such conditions cannot
> be met without putting them in the same asm block, so we might need to
> do a custom version for risc-v if we want to take advantage of the
> stronger properties.
> 
> Anyway, at this point the main obstacle to finishing the task is doing
> the actual merging and testing, not any new coding, I think.

Most of this is committed now!

The original commit introducing the new system is:

http://git.musl-libc.org/cgit/musl/commit/?id=1315596b510189b5159e742110b504177bdd4932

Subsequent commits are converting archs over one by one to make best
use of the new system.

I'd really like to see some before-and-after benchmarks on real
hardware. For Timo's cond_bench test which I've been using as a quick
check to make sure the new atomics have no obvious breakage,
performance under qemu user-level emulation has roughly doubled for
most of the llsc archs due to better ability of gcc to inline (it
refuses to inline the largeish ll/sc loops written in asm but is happy
to inline the tiny a_ll and a_sc asm).

Please report any regressions, etc.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-01-22  0:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-10 12:21 atomic.h cleanup Markus Wichmann
2016-01-10 16:57 ` Rich Felker
2016-01-10 17:35   ` Markus Wichmann
2016-01-10 17:50     ` Alexander Monakov
2016-01-11 16:35       ` Markus Wichmann
2016-01-11 17:12         ` Jens Gustedt
2016-01-11 19:03           ` Szabolcs Nagy
2016-01-11 20:56             ` Jens Gustedt
2016-01-14 22:12               ` Rich Felker
2016-01-14 22:37                 ` Jens Gustedt
2016-01-14 23:32                   ` Rich Felker
2016-01-15  0:46                     ` Szabolcs Nagy
2016-01-10 17:37   ` Markus Wichmann
2016-01-22  0:09   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).