* SH runtime switchable atomics - proposed design @ 2016-01-19 20:28 Rich Felker 2016-01-19 20:51 ` Rich Felker 2016-01-20 23:08 ` Oleg Endo 0 siblings, 2 replies; 6+ messages in thread From: Rich Felker @ 2016-01-19 20:28 UTC (permalink / raw) To: gcc; +Cc: Oleg Endo, musl I've been working on the new version of runtime-selected SH atomics for musl, and I think what I've got might be appropriate for GCC's generated atomics too. I know Oleg was not very excited about doing this on the gcc side from a cost/benefit perspective, but I think my approach is actually preferable over inline atomics from a code size perspective. It uses a single "cas" function with an "SFUNC" type ABI (not standard calling convention) with the following constraints: Inputs: - R0: Memory address to operate on - R1: Address of implementation function, loaded from a global - R2: Comparison value - R3: Value to set on success Outputs: - R3: Old value read, ==R2 iff cas succeeded. Preserved: R0, R2. Clobbered: R1, PR, T. This call (performed from __asm__ for musl, but gcc would do it as SH "SFUNC") is highly compact/convenient for inlining because it avoids clobbering any of the argument registers that are likely to already be in use by the caller, and it preserves the important values that are likely to be reused after the cas operation. For J2 and future J4, the function pointer just points to: rts cas.l r2,r3,@r0 and the only costs vs an inline cas.l are loading the address of the function (done in the caller; involves GOT access) and clobbering R1 and PR. This is still a draft design and the version in musl is subject to change at any time since it's not a public API/ABI, but I think it could turn into something useful to have on the gcc side with a -matomic-model=libfunc option or similar. Other ABI considerations for gcc use would be where to store the function pointer and how to initialize it. To be reasonably efficient with FDPIC the caller needs to be responsible for loading the function pointer (and it needs to always point to code, not a function descriptor) so that the callee does not need a GOT pointer passed in. Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SH runtime switchable atomics - proposed design 2016-01-19 20:28 SH runtime switchable atomics - proposed design Rich Felker @ 2016-01-19 20:51 ` Rich Felker 2016-01-20 23:08 ` Oleg Endo 1 sibling, 0 replies; 6+ messages in thread From: Rich Felker @ 2016-01-19 20:51 UTC (permalink / raw) To: gcc; +Cc: Oleg Endo, musl [-- Attachment #1: Type: text/plain, Size: 2139 bytes --] On Tue, Jan 19, 2016 at 03:28:52PM -0500, Rich Felker wrote: > I've been working on the new version of runtime-selected SH atomics > for musl, and I think what I've got might be appropriate for GCC's > generated atomics too. I know Oleg was not very excited about doing > this on the gcc side from a cost/benefit perspective, but I think my > approach is actually preferable over inline atomics from a code size > perspective. It uses a single "cas" function with an "SFUNC" type ABI > (not standard calling convention) with the following constraints: > > Inputs: > - R0: Memory address to operate on > - R1: Address of implementation function, loaded from a global > - R2: Comparison value > - R3: Value to set on success > > Outputs: > - R3: Old value read, ==R2 iff cas succeeded. > > Preserved: R0, R2. > > Clobbered: R1, PR, T. > > This call (performed from __asm__ for musl, but gcc would do it as SH > "SFUNC") is highly compact/convenient for inlining because it avoids > clobbering any of the argument registers that are likely to already be > in use by the caller, and it preserves the important values that are > likely to be reused after the cas operation. > > For J2 and future J4, the function pointer just points to: > > rts > cas.l r2,r3,@r0 > > and the only costs vs an inline cas.l are loading the address of the > function (done in the caller; involves GOT access) and clobbering R1 > and PR. > > This is still a draft design and the version in musl is subject to > change at any time since it's not a public API/ABI, but I think it > could turn into something useful to have on the gcc side with a > -matomic-model=libfunc option or similar. Other ABI considerations for > gcc use would be where to store the function pointer and how to > initialize it. To be reasonably efficient with FDPIC the caller needs > to be responsible for loading the function pointer (and it needs to > always point to code, not a function descriptor) so that the callee > does not need a GOT pointer passed in. Attached is my current draft of the implementations of the cas 'sfunc' for musl. Forgot to include it before. Rich [-- Attachment #2: sh.s --] [-- Type: text/plain, Size: 802 bytes --] /* Contract for all versions is same as cas.l r2,r3,@r0 * pr and r1 are also clobbered (by jsr & r1 as temp). * r0,r2,r4-r15 must be preserved. * r3 contains result (==r2 iff cas succeeded). */ .align 2 __sh_cas_gusa: mov.l r5,@-r15 mov.l r4,@-r15 mov.l r0,r4 mova 1f,r0 mov r15,r1 mov #(0f-1f),r15 0: mov.l @r4,r5 cmp/eq r5,r2 bf 1f mov.l r3,@r4 1: mov r1,r15 mov r5,r3 mov r4,r0 mov.l @r15+,r4 rts mov.l @r15+,r5 __sh_cas_llsc: mov r0,r1 synco 0: movli.l @r1,r0 cmp/eq r0,r2 bf 1f mov r3,r0 movco.l r0,@r1 bf 0b mov r2,r0 1: synco mov r0,r3 rts mov r1,r0 __sh_cas_imask: mov r0,r1 stc sr,r0 mov.l r0,@-r15 or #0xf0,r0 ldc r0,sr mov.l @r1,r0 cmp/eq r0,r2 bf 1f mov r3,@r1 1: ldc.l @r15+,sr mov r0,r3 rts mov r1,r0 __sh_cas_cas_l: rts cas.l r2,r3,@r0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SH runtime switchable atomics - proposed design 2016-01-19 20:28 SH runtime switchable atomics - proposed design Rich Felker 2016-01-19 20:51 ` Rich Felker @ 2016-01-20 23:08 ` Oleg Endo 2016-01-21 1:22 ` Rich Felker 1 sibling, 1 reply; 6+ messages in thread From: Oleg Endo @ 2016-01-20 23:08 UTC (permalink / raw) To: Rich Felker, gcc; +Cc: musl On Tue, 2016-01-19 at 15:28 -0500, Rich Felker wrote: > I've been working on the new version of runtime-selected SH atomics > for musl, and I think what I've got might be appropriate for GCC's > generated atomics too. I know Oleg was not very excited about doing > this on the gcc side from a cost/benefit perspective I am just not keen on making this the default atomic model for SH. If you have a system built around this atomic model and want to add it to GCC, please send in patches. Just a few comments below... > Inputs: > - R0: Memory address to operate on > - R1: Address of implementation function, loaded from a global > - R2: Comparison value > - R3: Value to set on success > > Outputs: > - R3: Old value read, ==R2 iff cas succeeded. > Preserved: R0, R2. > > Clobbered: R1, PR, T. The T bit is obviously the result of the cas operation. So you could use it as an output directly instead of the implicit R3 == R2 condition. > > This call (performed from __asm__ for musl, but gcc would do it as SH > "SFUNC") is highly compact/convenient for inlining because it avoids > clobbering any of the argument registers that are likely to already > be > in use by the caller, and it preserves the important values that are > likely to be reused after the cas operation. > > For J2 and future J4, the function pointer just points to: > > rts > cas.l r2,r3,@r0 > > and the only costs vs an inline cas.l are loading the address of the > function (done in the caller; involves GOT access) and clobbering R1 > and PR. > > This is still a draft design and the version in musl is subject to > change at any time since it's not a public API/ABI, but I think it > could turn into something useful to have on the gcc side with a > -matomic-model=libfunc option or similar. Other ABI considerations > for > gcc use would be where to store the function pointer and how to > initialize it. To be reasonably efficient with FDPIC the caller needs > to be responsible for loading the function pointer (and it needs to > always point to code, not a function descriptor) so that the callee > does not need a GOT pointer passed in. Obviously the ABI has been constructed around the J-core's cas.l instruction. Do you have plans to add other atomic operations (like arithmetic)? If not, then I'd suggest to name the atomic model "libfunc-musl-cas". Cheers, Oleg ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SH runtime switchable atomics - proposed design 2016-01-20 23:08 ` Oleg Endo @ 2016-01-21 1:22 ` Rich Felker 2016-01-21 11:22 ` Torvald Riegel 2016-01-21 11:32 ` Oleg Endo 0 siblings, 2 replies; 6+ messages in thread From: Rich Felker @ 2016-01-21 1:22 UTC (permalink / raw) To: Oleg Endo; +Cc: gcc, musl On Thu, Jan 21, 2016 at 08:08:18AM +0900, Oleg Endo wrote: > On Tue, 2016-01-19 at 15:28 -0500, Rich Felker wrote: > > I've been working on the new version of runtime-selected SH atomics > > for musl, and I think what I've got might be appropriate for GCC's > > generated atomics too. I know Oleg was not very excited about doing > > this on the gcc side from a cost/benefit perspective > > I am just not keen on making this the default atomic model for SH. > If you have a system built around this atomic model and want to add it > to GCC, please send in patches. Just a few comments below... OK, thanks for clarifying. I don't have a patch yet but I might do one later. Sato-san's work on adding direct cas.l support showed me how this part of the gcc code seems to work, so it shouldn't be too hard to hook it up, but there are ABI design considerations still if we decide to go this way. > > Inputs: > > - R0: Memory address to operate on > > - R1: Address of implementation function, loaded from a global > > - R2: Comparison value > > - R3: Value to set on success > > > > Outputs: > > - R3: Old value read, ==R2 iff cas succeeded. > > > Preserved: R0, R2. > > > > Clobbered: R1, PR, T. > > The T bit is obviously the result of the cas operation. So you could > use it as an output directly instead of the implicit R3 == R2 > condition. I didn't want to impose a requirement that all backends leave the result in the T bit. At the C source level, I think most software uses old==expected as the test for success; this is the API __sync_val_compare_and_swap provides, and what people used to x86 would naturally do anyway. > > This call (performed from __asm__ for musl, but gcc would do it as SH > > "SFUNC") is highly compact/convenient for inlining because it avoids > > clobbering any of the argument registers that are likely to already > > be > > in use by the caller, and it preserves the important values that are > > likely to be reused after the cas operation. > > > > For J2 and future J4, the function pointer just points to: > > > > rts > > cas.l r2,r3,@r0 > > > > > and the only costs vs an inline cas.l are loading the address of the > > function (done in the caller; involves GOT access) and clobbering R1 > > and PR. > > > > This is still a draft design and the version in musl is subject to > > change at any time since it's not a public API/ABI, but I think it > > could turn into something useful to have on the gcc side with a > > -matomic-model=libfunc option or similar. Other ABI considerations > > for > > gcc use would be where to store the function pointer and how to > > initialize it. To be reasonably efficient with FDPIC the caller needs > > to be responsible for loading the function pointer (and it needs to > > always point to code, not a function descriptor) so that the callee > > does not need a GOT pointer passed in. > > Obviously the ABI has been constructed around the J-core's cas.l > instruction. Yes, but that was a choice I made after a first draft that was no more optimal for the other backends and less optimal for J-core. And the only real choices that were based on the instruction's properties were using r0 for the address input and swapping the old value into r3 rather than producing it in a different register. Other than these minor details ABI was guided more by avoiding clobbers/reloads of potentially valuable data in the caller. One possible change I just thought of: with one extra instruction in the J-core version we could have the result come out in r1 and preserve r3. Similar changes to the other versions are probably easy. > Do you have plans to add other atomic operations (like > arithmetic)? No, at least not in musl. From musl's perspective cas is the main one that's used anyway. But even in general I don't think there's a significant advantage to doing 'direct' arithmetic ops without a cas loop even when you can (with llsc, gusa, or imask model). With gusa and imask the only time you benefit from not implementing them in terms of cas is on the _highly_ unlucky/unlikely occasion where an interrupt occurs between the old-value read before cas and the cas. For llsc there's more potential advantage because actual smp contention is possible, but sh4a is probably not a very interesting target anymore. > If not, then I'd suggest to name the atomic model > "libfunc-musl-cas". I'm not sure how the "musl" naming here makes sense unless you're thinking of having it just call into musl's definitions, which is certainly a possible design but not what I had in mind. I was thinking of adapting the design to gcc and providing something similar via libgcc.a. Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SH runtime switchable atomics - proposed design 2016-01-21 1:22 ` Rich Felker @ 2016-01-21 11:22 ` Torvald Riegel 2016-01-21 11:32 ` Oleg Endo 1 sibling, 0 replies; 6+ messages in thread From: Torvald Riegel @ 2016-01-21 11:22 UTC (permalink / raw) To: Rich Felker; +Cc: Oleg Endo, gcc, musl On Wed, 2016-01-20 at 20:22 -0500, Rich Felker wrote: > On Thu, Jan 21, 2016 at 08:08:18AM +0900, Oleg Endo wrote: > > Do you have plans to add other atomic operations (like > > arithmetic)? > > No, at least not in musl. From musl's perspective cas is the main one > that's used anyway. But even in general I don't think there's a > significant advantage to doing 'direct' arithmetic ops without a cas > loop even when you can (with llsc, gusa, or imask model). With gusa > and imask the only time you benefit from not implementing them in > terms of cas is on the _highly_ unlucky/unlikely occasion where an > interrupt occurs between the old-value read before cas and the cas. > For llsc there's more potential advantage because actual smp > contention is possible, but sh4a is probably not a very interesting > target anymore. Things like atomic increments are combinable, so you can make them scale better. You can do combining too with CAS, but if you're using the CAS to implement an increment, the program would have to issue another CAS, so you're not gaining as much. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SH runtime switchable atomics - proposed design 2016-01-21 1:22 ` Rich Felker 2016-01-21 11:22 ` Torvald Riegel @ 2016-01-21 11:32 ` Oleg Endo 1 sibling, 0 replies; 6+ messages in thread From: Oleg Endo @ 2016-01-21 11:32 UTC (permalink / raw) To: Rich Felker; +Cc: gcc, musl On Wed, 2016-01-20 at 20:22 -0500, Rich Felker wrote: > On Thu, Jan 21, 2016 at 08:08:18AM +0900, Oleg Endo wrote: > > On Tue, 2016-01-19 at 15:28 -0500, Rich Felker wrote: > > > I've been working on the new version of runtime-selected SH > > > atomics > > > for musl, and I think what I've got might be appropriate for > > > GCC's > > > generated atomics too. I know Oleg was not very excited about > > > doing > > > this on the gcc side from a cost/benefit perspective > > > > I am just not keen on making this the default atomic model for SH. > > If you have a system built around this atomic model and want to add > > it > > to GCC, please send in patches. Just a few comments below... > > OK, thanks for clarifying. I don't have a patch yet but I might do > one > later. Sato-san's work on adding direct cas.l support showed me how > this part of the gcc code seems to work, so it shouldn't be too hard > to hook it up, but there are ABI design considerations still if we > decide to go this way. > > > > Inputs: > > > - R0: Memory address to operate on > > > - R1: Address of implementation function, loaded from a global > > > - R2: Comparison value > > > - R3: Value to set on success > > > > > > Outputs: > > > - R3: Old value read, ==R2 iff cas succeeded. > > > > > Preserved: R0, R2. > > > > > > Clobbered: R1, PR, T. > > > > The T bit is obviously the result of the cas operation. So you > > could > > use it as an output directly instead of the implicit R3 == R2 > > condition. > > I didn't want to impose a requirement that all backends leave the > result in the T bit. At the C source level, I think most software > uses > old==expected as the test for success; this is the API > __sync_val_compare_and_swap provides, and what people used to x86 > would naturally do anyway. > > > > This call (performed from __asm__ for musl, but gcc would do it > > > as SH > > > "SFUNC") is highly compact/convenient for inlining because it > > > avoids > > > clobbering any of the argument registers that are likely to > > > already > > > be > > > in use by the caller, and it preserves the important values that > > > are > > > likely to be reused after the cas operation. > > > > > > For J2 and future J4, the function pointer just points to: > > > > > > rts > > > cas.l r2,r3,@r0 > > > > > > > > and the only costs vs an inline cas.l are loading the address of > > > the > > > function (done in the caller; involves GOT access) and clobbering > > > R1 > > > and PR. > > > > > > This is still a draft design and the version in musl is subject > > > to > > > change at any time since it's not a public API/ABI, but I think > > > it > > > could turn into something useful to have on the gcc side with a > > > -matomic-model=libfunc option or similar. Other ABI > > > considerations > > > for > > > gcc use would be where to store the function pointer and how to > > > initialize it. To be reasonably efficient with FDPIC the caller > > > needs > > > to be responsible for loading the function pointer (and it needs > > > to > > > always point to code, not a function descriptor) so that the > > > callee > > > does not need a GOT pointer passed in. > > > > Obviously the ABI has been constructed around the J-core's cas.l > > instruction. > > Yes, but that was a choice I made after a first draft that was no > more > optimal for the other backends and less optimal for J-core. And the > only real choices that were based on the instruction's properties > were > using r0 for the address input and swapping the old value into r3 > rather than producing it in a different register. Other than these > minor details ABI was guided more by avoiding clobbers/reloads of > potentially valuable data in the caller. > > One possible change I just thought of: with one extra instruction in > the J-core version we could have the result come out in r1 and > preserve r3. Similar changes to the other versions are probably easy. > > > Do you have plans to add other atomic operations (like > > arithmetic)? > > No, at least not in musl. From musl's perspective cas is the main one > that's used anyway. But even in general I don't think there's a > significant advantage to doing 'direct' arithmetic ops without a cas > loop even when you can (with llsc, gusa, or imask model). With gusa > and imask the only time you benefit from not implementing them in > terms of cas is on the _highly_ unlucky/unlikely occasion where an > interrupt occurs between the old-value read before cas and the cas. > For llsc there's more potential advantage because actual smp > contention is possible, but sh4a is probably not a very interesting > target anymore. > > > If not, then I'd suggest to name the atomic model > > "libfunc-musl-cas". > > I'm not sure how the "musl" naming here makes sense unless you're > thinking of having it just call into musl's definitions, which is > certainly a possible design but not what I had in mind. I was > thinking > of adapting the design to gcc and providing something similar via > libgcc.a. > I think it will be easier to discuss this with a patch at hand. Right now I can't really imagine what exactly you want to put where. The exact asm code of the sequences/functions and the ABI is up to you. You will know what works best for your system. Of course we can add the resulting necessary support functions to libgcc (which get compiled-in only if the compiler is configured to use the new atomic model). Cheers, Oleg ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-01-21 11:32 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-19 20:28 SH runtime switchable atomics - proposed design Rich Felker 2016-01-19 20:51 ` Rich Felker 2016-01-20 23:08 ` Oleg Endo 2016-01-21 1:22 ` Rich Felker 2016-01-21 11:22 ` Torvald Riegel 2016-01-21 11:32 ` Oleg Endo
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).