mailing list of musl libc
 help / color / mirror / code / Atom feed
* =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?=
@ 2023-05-18  2:49 =?gb18030?B?ODQ3NTY3MTYx?=
  2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy
  0 siblings, 1 reply; 7+ messages in thread
From: =?gb18030?B?ODQ3NTY3MTYx?= @ 2023-05-18  2:49 UTC (permalink / raw)
  To: =?gb18030?B?bXVzbA==?=

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb18030", Size: 1135 bytes --]

Hi,

> There is an alternate algorithm for pthread_once that doesn't require
> a barrier in the common case, which I've considered implementing. But
> it does need efficient access to thread-local storage. At one time,
> this was a kinda bad assumption (especially legacy mips is horribly
> slow at TLS) but nowadays it's probably the right choice to make, and
> we should check that out again...

1¡¢Can we move dmb after we get the value of control£¿ like this£º

int __pthread_once(pthread_once_t *control, void (*init)(void))
{
    /* Return immediately if init finished before, but ensure that
    * effects of the init routine are visible to the caller. */
    if (*(volatile int *)control == 2) {
        // a_barrier();
        return 0;
    }

    a_barrier();
    return __pthread_once_full(control, init);
}

2¡¢Can we use 'ldar' to  instead of dmb here? I see musl
already use 'stlxr' in a_sc.  like this:

static inline int load(volatile int *p)
{
	int v;
	__asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p));
	return v;
}

if (load((volatile int *)control) == 2) {
    return 0;
}

...


Chuang Yin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18  2:49 =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?= =?gb18030?B?ODQ3NTY3MTYx?=
@ 2023-05-18 12:23 ` Szabolcs Nagy
  2023-05-18 13:29   ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Szabolcs Nagy @ 2023-05-18 12:23 UTC (permalink / raw)
  To: 847567161; +Cc: musl

* 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]:
> &gt; There is an alternate algorithm for pthread_once that doesn't require
> &gt; a barrier in the common case, which I've considered implementing. But
> &gt; it does need efficient access to thread-local storage. At one time,
> &gt; this was a kinda bad assumption (especially legacy mips is horribly
> &gt; slow at TLS) but nowadays it's probably the right choice to make, and
> &gt; we should check that out again...
> 
> 1、Can we move dmb after we get the value of control? like this:
> 
> int __pthread_once(pthread_once_t *control, void (*init)(void))
> {
>     /* Return immediately if init finished before, but ensure that
>     * effects of the init routine are visible to the caller. */
>     if (*(volatile int *)control == 2) {
>         // a_barrier();
>         return 0;
>     }

writes in init may not be visible when *control==2, without
the barrier. (there are many explanations on the web why
double-checked locking is wrong without an acquire barrier,
that's the same issue if you are interested in the details)

> 2、Can we use 'ldar' to  instead of dmb here? I see musl
> already use 'stlxr' in a_sc.  like this:
> 
> static inline int load(volatile int *p)
> {
> 	int v;
> 	__asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p));
> 	return v;
> }
> 
> if (load((volatile int *)control) == 2) {
>     return 0;
> }

i think acquire ordering is enough because posix does not
require pthread_once to synchronize memory, but musl does
not have an acquire barrier/load, so it uses a_barrier.

it is probably not worth optimizing the memory order since
we know there is an algorithm that does not need a barrier
in the common case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy
@ 2023-05-18 13:29   ` Rich Felker
  2023-05-18 14:01     ` Jₑₙₛ Gustedt
  2023-05-18 14:15     ` Jeffrey Walton
  0 siblings, 2 replies; 7+ messages in thread
From: Rich Felker @ 2023-05-18 13:29 UTC (permalink / raw)
  To: 847567161, musl

On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote:
> * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]:
> > &gt; There is an alternate algorithm for pthread_once that doesn't require
> > &gt; a barrier in the common case, which I've considered implementing. But
> > &gt; it does need efficient access to thread-local storage. At one time,
> > &gt; this was a kinda bad assumption (especially legacy mips is horribly
> > &gt; slow at TLS) but nowadays it's probably the right choice to make, and
> > &gt; we should check that out again...
> > 
> > 1、Can we move dmb after we get the value of control? like this:
> > 
> > int __pthread_once(pthread_once_t *control, void (*init)(void))
> > {
> >     /* Return immediately if init finished before, but ensure that
> >     * effects of the init routine are visible to the caller. */
> >     if (*(volatile int *)control == 2) {
> >         // a_barrier();
> >         return 0;
> >     }
> 
> writes in init may not be visible when *control==2, without
> the barrier. (there are many explanations on the web why
> double-checked locking is wrong without an acquire barrier,
> that's the same issue if you are interested in the details)
> 
> > 2、Can we use 'ldar' to  instead of dmb here? I see musl
> > already use 'stlxr' in a_sc.  like this:
> > 
> > static inline int load(volatile int *p)
> > {
> > 	int v;
> > 	__asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p));
> > 	return v;
> > }
> > 
> > if (load((volatile int *)control) == 2) {
> >     return 0;
> > }
> 
> i think acquire ordering is enough because posix does not
> require pthread_once to synchronize memory, but musl does
> not have an acquire barrier/load, so it uses a_barrier.

POSIX does require this. It's specified where Memory Synchronization
is defined,
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12

    "The pthread_once() function shall synchronize memory for the
    first call in each thread for a given pthread_once_t object."

> it is probably not worth optimizing the memory order since
> we know there is an algorithm that does not need a barrier
> in the common case.

Arguably the above might make the barrier-free algorithm invalid for
pthread_once, but I'm not sure if the lack of "synchronize memory"
property in this case would be observable. It probably is with an
intentional construct trying to observe it. There may be some way to
salvage this with a second thread-local counter to account for
gratuitous extra synchronization needed.

Of course call_once is exempt from any such requirements (also exempt
from cancellation shenanigans) and is probably the optimal thing for
programs to use. If needed we can make call_once have a different,
more optimal implementation than pthread_once.

We should probably also file an issue for POSIX to relax the
requirements on pthread_once here, if they're actually a hindrance to
doing this right.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18 13:29   ` Rich Felker
@ 2023-05-18 14:01     ` Jₑₙₛ Gustedt
  2023-05-18 14:08       ` Rich Felker
  2023-05-18 14:15     ` Jeffrey Walton
  1 sibling, 1 reply; 7+ messages in thread
From: Jₑₙₛ Gustedt @ 2023-05-18 14:01 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, 847567161

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

Rich,

on Thu, 18 May 2023 09:29:05 -0400 you (Rich Felker <dalias@libc.org>)
wrote:

> Of course call_once is exempt from any such requirements

it was, but isn't anymore. In C23, now we have

    Completion of an effective call to the `call_once` function
    synchronizes with all subsequent calls to the `call_once` function
    with the same value of `flag`.

POSIX (for which the ISO 9945 instantiation is currently at NB ballot)
also has updated all of this

      The pthread_once() and call_once() functions shall synchronize
      memory for the first successful call in each thread for a given
      pthread_once_t or once_flag object, respectively. If the
      init_routine called by pthread_once() or call_once() is a
      cancellation point and is canceled, a successful call to
      pthread_once() for the same pthread_once_t object or to
      call_once() for the same once_flag object, made from a
      cancellation cleanup handler shall also synchronize memory.

If I understand this correctly, the C11 interfaces become mandatory if
the platform supports threads.


Thanks
Jₑₙₛ

-- 
:: ICube :::::::::::::::::::::::::::::: deputy director ::
:: Université de Strasbourg :::::::::::::::::::::: ICPS ::
:: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus ::
:: :::::::::::::::::::::::::::::::::::: ☎ +33 368854536 ::
:: https://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18 14:01     ` Jₑₙₛ Gustedt
@ 2023-05-18 14:08       ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2023-05-18 14:08 UTC (permalink / raw)
  To: Jₑₙₛ Gustedt; +Cc: musl, 847567161

On Thu, May 18, 2023 at 04:01:18PM +0200, Jₑₙₛ Gustedt wrote:
> Rich,
> 
> on Thu, 18 May 2023 09:29:05 -0400 you (Rich Felker <dalias@libc.org>)
> wrote:
> 
> > Of course call_once is exempt from any such requirements
> 
> it was, but isn't anymore. In C23, now we have
> 
>     Completion of an effective call to the `call_once` function
>     synchronizes with all subsequent calls to the `call_once` function
>     with the same value of `flag`.

That's fine because it's for the same value of flag. The old POSIX
requirement is independent of the value of flag (just like
"synchronizes memory" for mutex lock is independent of which mutex
it's called with, etc.). This is why POSIX (as written) requires full
seq_cst for everything.

> POSIX (for which the ISO 9945 instantiation is currently at NB ballot)
> also has updated all of this
> 
>       The pthread_once() and call_once() functions shall synchronize
>       memory for the first successful call in each thread for a given
>       pthread_once_t or once_flag object, respectively. If the
>       init_routine called by pthread_once() or call_once() is a
>       cancellation point and is canceled, a successful call to
>       pthread_once() for the same pthread_once_t object or to
>       call_once() for the same once_flag object, made from a
>       cancellation cleanup handler shall also synchronize memory.
> 
> If I understand this correctly, the C11 interfaces become mandatory if
> the platform supports threads.

Looks like POSIX is fixing it then, by making it specific to the
object, just like C11. So all should be good.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18 13:29   ` Rich Felker
  2023-05-18 14:01     ` Jₑₙₛ Gustedt
@ 2023-05-18 14:15     ` Jeffrey Walton
  2023-05-18 14:20       ` Rich Felker
  1 sibling, 1 reply; 7+ messages in thread
From: Jeffrey Walton @ 2023-05-18 14:15 UTC (permalink / raw)
  To: musl; +Cc: 847567161

On Thu, May 18, 2023 at 9:29 AM Rich Felker <dalias@libc.org> wrote:
>
> On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote:
> > * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]:
> > > &gt; There is an alternate algorithm for pthread_once that doesn't require
> > > &gt; a barrier in the common case, which I've considered implementing. But
> > > &gt; it does need efficient access to thread-local storage. At one time,
> > > &gt; this was a kinda bad assumption (especially legacy mips is horribly
> > > &gt; slow at TLS) but nowadays it's probably the right choice to make, and
> > > &gt; we should check that out again...
> > >
> > > 1、Can we move dmb after we get the value of control? like this:
> > >
> > > int __pthread_once(pthread_once_t *control, void (*init)(void))
> > > {
> > >     /* Return immediately if init finished before, but ensure that
> > >     * effects of the init routine are visible to the caller. */
> > >     if (*(volatile int *)control == 2) {
> > >         // a_barrier();
> > >         return 0;
> > >     }
> >
> > writes in init may not be visible when *control==2, without
> > the barrier. (there are many explanations on the web why
> > double-checked locking is wrong without an acquire barrier,
> > that's the same issue if you are interested in the details)
> >
> > > 2、Can we use 'ldar' to  instead of dmb here? I see musl
> > > already use 'stlxr' in a_sc.  like this:
> > >
> > > static inline int load(volatile int *p)
> > > {
> > >     int v;
> > >     __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p));
> > >     return v;
> > > }
> > >
> > > if (load((volatile int *)control) == 2) {
> > >     return 0;
> > > }
> >
> > i think acquire ordering is enough because posix does not
> > require pthread_once to synchronize memory, but musl does
> > not have an acquire barrier/load, so it uses a_barrier.
>
> POSIX does require this. It's specified where Memory Synchronization
> is defined,
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
>
>     "The pthread_once() function shall synchronize memory for the
>     first call in each thread for a given pthread_once_t object."
>
> > it is probably not worth optimizing the memory order since
> > we know there is an algorithm that does not need a barrier
> > in the common case.
>
> Arguably the above might make the barrier-free algorithm invalid for
> pthread_once, but I'm not sure if the lack of "synchronize memory"
> property in this case would be observable. It probably is with an
> intentional construct trying to observe it. There may be some way to
> salvage this with a second thread-local counter to account for
> gratuitous extra synchronization needed.
>
> Of course call_once is exempt from any such requirements (also exempt
> from cancellation shenanigans) and is probably the optimal thing for
> programs to use. If needed we can make call_once have a different,
> more optimal implementation than pthread_once.

Be careful of call_once.

Several years ago I cut over to C++11's call_once. The problem was, it
only worked reliably on 32-bit and 64-bit Intel platforms. It was a
disaster on Aarch64, PowerPC and Sparc. I had to back it out.

The problems happened back when GCC 6 and 7 were popular. The problem
was due to something sideways in glibc.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146

If you want a call_once-like initialization then rely on N2660:
Dynamic Initialization and Destruction with Concurrency.

Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once?
  2023-05-18 14:15     ` Jeffrey Walton
@ 2023-05-18 14:20       ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2023-05-18 14:20 UTC (permalink / raw)
  To: Jeffrey Walton; +Cc: musl, 847567161

On Thu, May 18, 2023 at 10:15:36AM -0400, Jeffrey Walton wrote:
> On Thu, May 18, 2023 at 9:29 AM Rich Felker <dalias@libc.org> wrote:
> >
> > On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote:
> > > * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]:
> > > > &gt; There is an alternate algorithm for pthread_once that doesn't require
> > > > &gt; a barrier in the common case, which I've considered implementing.. But
> > > > &gt; it does need efficient access to thread-local storage. At one time,
> > > > &gt; this was a kinda bad assumption (especially legacy mips is horribly
> > > > &gt; slow at TLS) but nowadays it's probably the right choice to make, and
> > > > &gt; we should check that out again...
> > > >
> > > > 1、Can we move dmb after we get the value of control? like this:
> > > >
> > > > int __pthread_once(pthread_once_t *control, void (*init)(void))
> > > > {
> > > >     /* Return immediately if init finished before, but ensure that
> > > >     * effects of the init routine are visible to the caller. */
> > > >     if (*(volatile int *)control == 2) {
> > > >         // a_barrier();
> > > >         return 0;
> > > >     }
> > >
> > > writes in init may not be visible when *control==2, without
> > > the barrier. (there are many explanations on the web why
> > > double-checked locking is wrong without an acquire barrier,
> > > that's the same issue if you are interested in the details)
> > >
> > > > 2、Can we use 'ldar' to  instead of dmb here? I see musl
> > > > already use 'stlxr' in a_sc.  like this:
> > > >
> > > > static inline int load(volatile int *p)
> > > > {
> > > >     int v;
> > > >     __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p));
> > > >     return v;
> > > > }
> > > >
> > > > if (load((volatile int *)control) == 2) {
> > > >     return 0;
> > > > }
> > >
> > > i think acquire ordering is enough because posix does not
> > > require pthread_once to synchronize memory, but musl does
> > > not have an acquire barrier/load, so it uses a_barrier.
> >
> > POSIX does require this. It's specified where Memory Synchronization
> > is defined,
> > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
> >
> >     "The pthread_once() function shall synchronize memory for the
> >     first call in each thread for a given pthread_once_t object."
> >
> > > it is probably not worth optimizing the memory order since
> > > we know there is an algorithm that does not need a barrier
> > > in the common case.
> >
> > Arguably the above might make the barrier-free algorithm invalid for
> > pthread_once, but I'm not sure if the lack of "synchronize memory"
> > property in this case would be observable. It probably is with an
> > intentional construct trying to observe it. There may be some way to
> > salvage this with a second thread-local counter to account for
> > gratuitous extra synchronization needed.
> >
> > Of course call_once is exempt from any such requirements (also exempt
> > from cancellation shenanigans) and is probably the optimal thing for
> > programs to use. If needed we can make call_once have a different,
> > more optimal implementation than pthread_once.
> 
> Be careful of call_once.
> 
> Several years ago I cut over to C++11's call_once. The problem was, it
> only worked reliably on 32-bit and 64-bit Intel platforms. It was a
> disaster on Aarch64, PowerPC and Sparc. I had to back it out.

That is about the C++ std::call_once, implemented by GNU libstdc++,
and is unrelated to the C11 call_once, which libc implements.

> The problems happened back when GCC 6 and 7 were popular. The problem
> was due to something sideways in glibc.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
> 
> If you want a call_once-like initialization then rely on N2660:
> Dynamic Initialization and Destruction with Concurrency.

That's the general algorithm we've been talking about (though without
bad properties like gratuitously inlining it to lock-in implementation
details as ABI).

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-18 14:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-18  2:49 =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?= =?gb18030?B?ODQ3NTY3MTYx?=
2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy
2023-05-18 13:29   ` Rich Felker
2023-05-18 14:01     ` Jₑₙₛ Gustedt
2023-05-18 14:08       ` Rich Felker
2023-05-18 14:15     ` Jeffrey Walton
2023-05-18 14:20       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).