* =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?= @ 2023-05-18 2:49 =?gb18030?B?ODQ3NTY3MTYx?= 2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy 0 siblings, 1 reply; 7+ messages in thread From: =?gb18030?B?ODQ3NTY3MTYx?= @ 2023-05-18 2:49 UTC (permalink / raw) To: =?gb18030?B?bXVzbA==?= [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="gb18030", Size: 1135 bytes --] Hi, > There is an alternate algorithm for pthread_once that doesn't require > a barrier in the common case, which I've considered implementing. But > it does need efficient access to thread-local storage. At one time, > this was a kinda bad assumption (especially legacy mips is horribly > slow at TLS) but nowadays it's probably the right choice to make, and > we should check that out again... 1¡¢Can we move dmb after we get the value of control£¿ like this£º int __pthread_once(pthread_once_t *control, void (*init)(void)) { /* Return immediately if init finished before, but ensure that * effects of the init routine are visible to the caller. */ if (*(volatile int *)control == 2) { // a_barrier(); return 0; } a_barrier(); return __pthread_once_full(control, init); } 2¡¢Can we use 'ldar' to instead of dmb here? I see musl already use 'stlxr' in a_sc. like this: static inline int load(volatile int *p) { int v; __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p)); return v; } if (load((volatile int *)control) == 2) { return 0; } ... Chuang Yin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 2:49 =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?= =?gb18030?B?ODQ3NTY3MTYx?= @ 2023-05-18 12:23 ` Szabolcs Nagy 2023-05-18 13:29 ` Rich Felker 0 siblings, 1 reply; 7+ messages in thread From: Szabolcs Nagy @ 2023-05-18 12:23 UTC (permalink / raw) To: 847567161; +Cc: musl * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]: > > There is an alternate algorithm for pthread_once that doesn't require > > a barrier in the common case, which I've considered implementing. But > > it does need efficient access to thread-local storage. At one time, > > this was a kinda bad assumption (especially legacy mips is horribly > > slow at TLS) but nowadays it's probably the right choice to make, and > > we should check that out again... > > 1、Can we move dmb after we get the value of control? like this: > > int __pthread_once(pthread_once_t *control, void (*init)(void)) > { > /* Return immediately if init finished before, but ensure that > * effects of the init routine are visible to the caller. */ > if (*(volatile int *)control == 2) { > // a_barrier(); > return 0; > } writes in init may not be visible when *control==2, without the barrier. (there are many explanations on the web why double-checked locking is wrong without an acquire barrier, that's the same issue if you are interested in the details) > 2、Can we use 'ldar' to instead of dmb here? I see musl > already use 'stlxr' in a_sc. like this: > > static inline int load(volatile int *p) > { > int v; > __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p)); > return v; > } > > if (load((volatile int *)control) == 2) { > return 0; > } i think acquire ordering is enough because posix does not require pthread_once to synchronize memory, but musl does not have an acquire barrier/load, so it uses a_barrier. it is probably not worth optimizing the memory order since we know there is an algorithm that does not need a barrier in the common case. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy @ 2023-05-18 13:29 ` Rich Felker 2023-05-18 14:01 ` Jₑₙₛ Gustedt 2023-05-18 14:15 ` Jeffrey Walton 0 siblings, 2 replies; 7+ messages in thread From: Rich Felker @ 2023-05-18 13:29 UTC (permalink / raw) To: 847567161, musl On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote: > * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]: > > > There is an alternate algorithm for pthread_once that doesn't require > > > a barrier in the common case, which I've considered implementing. But > > > it does need efficient access to thread-local storage. At one time, > > > this was a kinda bad assumption (especially legacy mips is horribly > > > slow at TLS) but nowadays it's probably the right choice to make, and > > > we should check that out again... > > > > 1、Can we move dmb after we get the value of control? like this: > > > > int __pthread_once(pthread_once_t *control, void (*init)(void)) > > { > > /* Return immediately if init finished before, but ensure that > > * effects of the init routine are visible to the caller. */ > > if (*(volatile int *)control == 2) { > > // a_barrier(); > > return 0; > > } > > writes in init may not be visible when *control==2, without > the barrier. (there are many explanations on the web why > double-checked locking is wrong without an acquire barrier, > that's the same issue if you are interested in the details) > > > 2、Can we use 'ldar' to instead of dmb here? I see musl > > already use 'stlxr' in a_sc. like this: > > > > static inline int load(volatile int *p) > > { > > int v; > > __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p)); > > return v; > > } > > > > if (load((volatile int *)control) == 2) { > > return 0; > > } > > i think acquire ordering is enough because posix does not > require pthread_once to synchronize memory, but musl does > not have an acquire barrier/load, so it uses a_barrier. POSIX does require this. It's specified where Memory Synchronization is defined, https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 "The pthread_once() function shall synchronize memory for the first call in each thread for a given pthread_once_t object." > it is probably not worth optimizing the memory order since > we know there is an algorithm that does not need a barrier > in the common case. Arguably the above might make the barrier-free algorithm invalid for pthread_once, but I'm not sure if the lack of "synchronize memory" property in this case would be observable. It probably is with an intentional construct trying to observe it. There may be some way to salvage this with a second thread-local counter to account for gratuitous extra synchronization needed. Of course call_once is exempt from any such requirements (also exempt from cancellation shenanigans) and is probably the optimal thing for programs to use. If needed we can make call_once have a different, more optimal implementation than pthread_once. We should probably also file an issue for POSIX to relax the requirements on pthread_once here, if they're actually a hindrance to doing this right. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 13:29 ` Rich Felker @ 2023-05-18 14:01 ` Jₑₙₛ Gustedt 2023-05-18 14:08 ` Rich Felker 2023-05-18 14:15 ` Jeffrey Walton 1 sibling, 1 reply; 7+ messages in thread From: Jₑₙₛ Gustedt @ 2023-05-18 14:01 UTC (permalink / raw) To: Rich Felker; +Cc: musl, 847567161 [-- Attachment #1: Type: text/plain, Size: 1448 bytes --] Rich, on Thu, 18 May 2023 09:29:05 -0400 you (Rich Felker <dalias@libc.org>) wrote: > Of course call_once is exempt from any such requirements it was, but isn't anymore. In C23, now we have Completion of an effective call to the `call_once` function synchronizes with all subsequent calls to the `call_once` function with the same value of `flag`. POSIX (for which the ISO 9945 instantiation is currently at NB ballot) also has updated all of this The pthread_once() and call_once() functions shall synchronize memory for the first successful call in each thread for a given pthread_once_t or once_flag object, respectively. If the init_routine called by pthread_once() or call_once() is a cancellation point and is canceled, a successful call to pthread_once() for the same pthread_once_t object or to call_once() for the same once_flag object, made from a cancellation cleanup handler shall also synchronize memory. If I understand this correctly, the C11 interfaces become mandatory if the platform supports threads. Thanks Jₑₙₛ -- :: ICube :::::::::::::::::::::::::::::: deputy director :: :: Université de Strasbourg :::::::::::::::::::::: ICPS :: :: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus :: :: :::::::::::::::::::::::::::::::::::: ☎ +33 368854536 :: :: https://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 14:01 ` Jₑₙₛ Gustedt @ 2023-05-18 14:08 ` Rich Felker 0 siblings, 0 replies; 7+ messages in thread From: Rich Felker @ 2023-05-18 14:08 UTC (permalink / raw) To: Jₑₙₛ Gustedt; +Cc: musl, 847567161 On Thu, May 18, 2023 at 04:01:18PM +0200, Jₑₙₛ Gustedt wrote: > Rich, > > on Thu, 18 May 2023 09:29:05 -0400 you (Rich Felker <dalias@libc.org>) > wrote: > > > Of course call_once is exempt from any such requirements > > it was, but isn't anymore. In C23, now we have > > Completion of an effective call to the `call_once` function > synchronizes with all subsequent calls to the `call_once` function > with the same value of `flag`. That's fine because it's for the same value of flag. The old POSIX requirement is independent of the value of flag (just like "synchronizes memory" for mutex lock is independent of which mutex it's called with, etc.). This is why POSIX (as written) requires full seq_cst for everything. > POSIX (for which the ISO 9945 instantiation is currently at NB ballot) > also has updated all of this > > The pthread_once() and call_once() functions shall synchronize > memory for the first successful call in each thread for a given > pthread_once_t or once_flag object, respectively. If the > init_routine called by pthread_once() or call_once() is a > cancellation point and is canceled, a successful call to > pthread_once() for the same pthread_once_t object or to > call_once() for the same once_flag object, made from a > cancellation cleanup handler shall also synchronize memory. > > If I understand this correctly, the C11 interfaces become mandatory if > the platform supports threads. Looks like POSIX is fixing it then, by making it specific to the object, just like C11. So all should be good. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 13:29 ` Rich Felker 2023-05-18 14:01 ` Jₑₙₛ Gustedt @ 2023-05-18 14:15 ` Jeffrey Walton 2023-05-18 14:20 ` Rich Felker 1 sibling, 1 reply; 7+ messages in thread From: Jeffrey Walton @ 2023-05-18 14:15 UTC (permalink / raw) To: musl; +Cc: 847567161 On Thu, May 18, 2023 at 9:29 AM Rich Felker <dalias@libc.org> wrote: > > On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote: > > * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]: > > > > There is an alternate algorithm for pthread_once that doesn't require > > > > a barrier in the common case, which I've considered implementing. But > > > > it does need efficient access to thread-local storage. At one time, > > > > this was a kinda bad assumption (especially legacy mips is horribly > > > > slow at TLS) but nowadays it's probably the right choice to make, and > > > > we should check that out again... > > > > > > 1、Can we move dmb after we get the value of control? like this: > > > > > > int __pthread_once(pthread_once_t *control, void (*init)(void)) > > > { > > > /* Return immediately if init finished before, but ensure that > > > * effects of the init routine are visible to the caller. */ > > > if (*(volatile int *)control == 2) { > > > // a_barrier(); > > > return 0; > > > } > > > > writes in init may not be visible when *control==2, without > > the barrier. (there are many explanations on the web why > > double-checked locking is wrong without an acquire barrier, > > that's the same issue if you are interested in the details) > > > > > 2、Can we use 'ldar' to instead of dmb here? I see musl > > > already use 'stlxr' in a_sc. like this: > > > > > > static inline int load(volatile int *p) > > > { > > > int v; > > > __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p)); > > > return v; > > > } > > > > > > if (load((volatile int *)control) == 2) { > > > return 0; > > > } > > > > i think acquire ordering is enough because posix does not > > require pthread_once to synchronize memory, but musl does > > not have an acquire barrier/load, so it uses a_barrier. > > POSIX does require this. It's specified where Memory Synchronization > is defined, > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 > > "The pthread_once() function shall synchronize memory for the > first call in each thread for a given pthread_once_t object." > > > it is probably not worth optimizing the memory order since > > we know there is an algorithm that does not need a barrier > > in the common case. > > Arguably the above might make the barrier-free algorithm invalid for > pthread_once, but I'm not sure if the lack of "synchronize memory" > property in this case would be observable. It probably is with an > intentional construct trying to observe it. There may be some way to > salvage this with a second thread-local counter to account for > gratuitous extra synchronization needed. > > Of course call_once is exempt from any such requirements (also exempt > from cancellation shenanigans) and is probably the optimal thing for > programs to use. If needed we can make call_once have a different, > more optimal implementation than pthread_once. Be careful of call_once. Several years ago I cut over to C++11's call_once. The problem was, it only worked reliably on 32-bit and 64-bit Intel platforms. It was a disaster on Aarch64, PowerPC and Sparc. I had to back it out. The problems happened back when GCC 6 and 7 were popular. The problem was due to something sideways in glibc. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 If you want a call_once-like initialization then rely on N2660: Dynamic Initialization and Destruction with Concurrency. Jeff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: [musl] Question:Why musl call a_barrier in __pthread_once? 2023-05-18 14:15 ` Jeffrey Walton @ 2023-05-18 14:20 ` Rich Felker 0 siblings, 0 replies; 7+ messages in thread From: Rich Felker @ 2023-05-18 14:20 UTC (permalink / raw) To: Jeffrey Walton; +Cc: musl, 847567161 On Thu, May 18, 2023 at 10:15:36AM -0400, Jeffrey Walton wrote: > On Thu, May 18, 2023 at 9:29 AM Rich Felker <dalias@libc.org> wrote: > > > > On Thu, May 18, 2023 at 02:23:06PM +0200, Szabolcs Nagy wrote: > > > * 847567161 <847567161@qq.com> [2023-05-18 10:49:44 +0800]: > > > > > There is an alternate algorithm for pthread_once that doesn't require > > > > > a barrier in the common case, which I've considered implementing.. But > > > > > it does need efficient access to thread-local storage. At one time, > > > > > this was a kinda bad assumption (especially legacy mips is horribly > > > > > slow at TLS) but nowadays it's probably the right choice to make, and > > > > > we should check that out again... > > > > > > > > 1、Can we move dmb after we get the value of control? like this: > > > > > > > > int __pthread_once(pthread_once_t *control, void (*init)(void)) > > > > { > > > > /* Return immediately if init finished before, but ensure that > > > > * effects of the init routine are visible to the caller. */ > > > > if (*(volatile int *)control == 2) { > > > > // a_barrier(); > > > > return 0; > > > > } > > > > > > writes in init may not be visible when *control==2, without > > > the barrier. (there are many explanations on the web why > > > double-checked locking is wrong without an acquire barrier, > > > that's the same issue if you are interested in the details) > > > > > > > 2、Can we use 'ldar' to instead of dmb here? I see musl > > > > already use 'stlxr' in a_sc. like this: > > > > > > > > static inline int load(volatile int *p) > > > > { > > > > int v; > > > > __asm__ __volatile__ ("ldar %w0,%1" : "=r"(v) : "Q"(*p)); > > > > return v; > > > > } > > > > > > > > if (load((volatile int *)control) == 2) { > > > > return 0; > > > > } > > > > > > i think acquire ordering is enough because posix does not > > > require pthread_once to synchronize memory, but musl does > > > not have an acquire barrier/load, so it uses a_barrier. > > > > POSIX does require this. It's specified where Memory Synchronization > > is defined, > > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 > > > > "The pthread_once() function shall synchronize memory for the > > first call in each thread for a given pthread_once_t object." > > > > > it is probably not worth optimizing the memory order since > > > we know there is an algorithm that does not need a barrier > > > in the common case. > > > > Arguably the above might make the barrier-free algorithm invalid for > > pthread_once, but I'm not sure if the lack of "synchronize memory" > > property in this case would be observable. It probably is with an > > intentional construct trying to observe it. There may be some way to > > salvage this with a second thread-local counter to account for > > gratuitous extra synchronization needed. > > > > Of course call_once is exempt from any such requirements (also exempt > > from cancellation shenanigans) and is probably the optimal thing for > > programs to use. If needed we can make call_once have a different, > > more optimal implementation than pthread_once. > > Be careful of call_once. > > Several years ago I cut over to C++11's call_once. The problem was, it > only worked reliably on 32-bit and 64-bit Intel platforms. It was a > disaster on Aarch64, PowerPC and Sparc. I had to back it out. That is about the C++ std::call_once, implemented by GNU libstdc++, and is unrelated to the C11 call_once, which libc implements. > The problems happened back when GCC 6 and 7 were popular. The problem > was due to something sideways in glibc. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 > > If you want a call_once-like initialization then rely on N2660: > Dynamic Initialization and Destruction with Concurrency. That's the general algorithm we've been talking about (though without bad properties like gratuitously inlining it to lock-in implementation details as ABI). Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-05-18 14:20 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-18 2:49 =?gb18030?B?UmU6IFJlOiBbbXVzbF0gUXVlc3Rpb26juldoeSBtdXNsIGNhbGwgYV9iYXJyaWVyIGluIF9fcHRocmVhZF9vbmNlPw==?= =?gb18030?B?ODQ3NTY3MTYx?= 2023-05-18 12:23 ` Re: [musl] Question:Why musl call a_barrier in __pthread_once? Szabolcs Nagy 2023-05-18 13:29 ` Rich Felker 2023-05-18 14:01 ` Jₑₙₛ Gustedt 2023-05-18 14:08 ` Rich Felker 2023-05-18 14:15 ` Jeffrey Walton 2023-05-18 14:20 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).