mailing list of musl libc
 help / color / mirror / code / Atom feed
* TLSDESC register-preserving mess
@ 2018-10-10  1:26 Rich Felker
  2018-10-10  2:35 ` Rich Felker
  2018-10-10 13:19 ` Szabolcs Nagy
  0 siblings, 2 replies; 4+ messages in thread
From: Rich Felker @ 2018-10-10  1:26 UTC (permalink / raw)
  To: musl

I've run across a bit of a problem in how the TLSDESC calling
conventions work. In the case where the needed DTV slot is not yet
filled in for the calling thread, the dynamic TLSDESC function needs
to call into C code that obtains the memory that was previously
reserved for it, initializes it (involving memcpy/memset), and fills
in the DTV entry for it. This requires saving and restoring any
call-clobbered registers that might be used by C code.

Because the operation involves memcpy/memset, it's not just
theoretically possible but likely that vector registers could be used.
As written, the aarch64 and arm asm save and restore float/vector
registers around the call, but I don't think they're future-proof
against ISA extensions that add more such registers; if libc were
built to use such a future ISA level, the asm we have now would be
unsafe. The i386 and x86_64 tlsdesc asm do not presently do anything
to save float/vector registers, and doing so would involve lots of
hwcap mess to figure out which ones are present. I think it would also
fail to be future-proof. Fortunately, i386 and x86_64 both provide
non-vector asm implementations of memcpy and memset, making it less
likely that any vector registers would be used in these code paths,
but still not impossible. It's also a hidden constraint, that things
only work because of the asm implementation details.

Unfortunately making a future-proof solution is really hard; this is a
consequence of the TLSDESC ABI and the way register file extensions
get done by cpu vendors.

One approach would be generating a fully-flattened version of
__tls_get_new for each arch that uses TLSDESC, via gcc -S, and
committing the output into the project as a source file.
Unfortunately, this involves atomic whose definitions vary by ISA
level on arm, so I think that makes it a no-go. Obviously it's also
really ugly.

Another approach is to depend on the compiler having flags that can be
used to build for a profile that only allows GPRs (no vector regs,
etc.), and building __tls_get_new as its own source file using these
flags. This is not the sort of tooling requirement I like, since it
abandons the principle of working with an arbitrary compiler with
minimal GNU C features.

The only approach I know that doesn't involve any tooling is having
the dynamic TLSDESC function raise a signal when it's missing the DTV
slot it needs. This delegates the responsibility for awareness of what
registers need saving to the kernel, which already must be aware in
order to perform context switching (you inherently can't run a binary
that uses new registers on an old kernel that's not aware of them).
This approach is nice in that it's entirely arch-agnostic, and works
for all present and future archs and ISA/register-file extensions. The
easy approach would just nab another SIGRTx as an
implementation-internal signal, so that all the asm would need to do
is a tkill syscall. Multiplexing on another signal should be possible
but makes for more complexity and I'm not sure there's any real
benefit.

My leaning is to go with the signal solution.

Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TLSDESC register-preserving mess
  2018-10-10  1:26 TLSDESC register-preserving mess Rich Felker
@ 2018-10-10  2:35 ` Rich Felker
  2018-10-10 13:19 ` Szabolcs Nagy
  1 sibling, 0 replies; 4+ messages in thread
From: Rich Felker @ 2018-10-10  2:35 UTC (permalink / raw)
  To: musl

On Tue, Oct 09, 2018 at 09:26:20PM -0400, Rich Felker wrote:
> I've run across a bit of a problem in how the TLSDESC calling
> conventions work. In the case where the needed DTV slot is not yet
> filled in for the calling thread, the dynamic TLSDESC function needs
> to call into C code that obtains the memory that was previously
> reserved for it, initializes it (involving memcpy/memset), and fills
> in the DTV entry for it. This requires saving and restoring any
> call-clobbered registers that might be used by C code.
> 
> Because the operation involves memcpy/memset, it's not just
> theoretically possible but likely that vector registers could be used.
> As written, the aarch64 and arm asm save and restore float/vector
> registers around the call, but I don't think they're future-proof
> against ISA extensions that add more such registers; if libc were
> built to use such a future ISA level, the asm we have now would be
> unsafe. The i386 and x86_64 tlsdesc asm do not presently do anything
> to save float/vector registers, and doing so would involve lots of
> hwcap mess to figure out which ones are present. I think it would also
> fail to be future-proof. Fortunately, i386 and x86_64 both provide
> non-vector asm implementations of memcpy and memset, making it less
> likely that any vector registers would be used in these code paths,
> but still not impossible. It's also a hidden constraint, that things
> only work because of the asm implementation details.
> 
> Unfortunately making a future-proof solution is really hard; this is a
> consequence of the TLSDESC ABI and the way register file extensions
> get done by cpu vendors.
> 
> One approach would be generating a fully-flattened version of
> __tls_get_new for each arch that uses TLSDESC, via gcc -S, and
> committing the output into the project as a source file.
> Unfortunately, this involves atomic whose definitions vary by ISA
> level on arm, so I think that makes it a no-go. Obviously it's also
> really ugly.
> 
> Another approach is to depend on the compiler having flags that can be
> used to build for a profile that only allows GPRs (no vector regs,
> etc.), and building __tls_get_new as its own source file using these
> flags. This is not the sort of tooling requirement I like, since it
> abandons the principle of working with an arbitrary compiler with
> minimal GNU C features.
> 
> The only approach I know that doesn't involve any tooling is having
> the dynamic TLSDESC function raise a signal when it's missing the DTV
> slot it needs. This delegates the responsibility for awareness of what
> registers need saving to the kernel, which already must be aware in
> order to perform context switching (you inherently can't run a binary
> that uses new registers on an old kernel that's not aware of them).
> This approach is nice in that it's entirely arch-agnostic, and works
> for all present and future archs and ISA/register-file extensions. The
> easy approach would just nab another SIGRTx as an
> implementation-internal signal, so that all the asm would need to do
> is a tkill syscall. Multiplexing on another signal should be possible
> but makes for more complexity and I'm not sure there's any real
> benefit.
> 
> My leaning is to go with the signal solution.

An alternate approach being proposed on #musl that I might like better
is getting rid of __tls_get_new entirely, having the DTV for all
existing threads updated at dlopen time. This requires either a
__synccall with no failure path (which we don't have) or adding a
linked list of threads. The non-__synccall approach also requires the
SYS_membarrier syscall (Linux 4.3) and emulation of it as a fallback
(which can be done via signals if you have a list of threads).

Aside from solving the tlsdesc clobber issue, what I like about this
approach is that it removes all branches from __tls_get_addr and the
dynamic tlsdesc function; they just *always succeed in the hot path*.
It also makes it easier to facilitate recovery of memory allocated for
dynamic TLS if we want to -- it no longer has to be a shared block
doled out to threads via a_fetch_add, so each thread could get its own
malloc and then be able to free it at exit.

Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TLSDESC register-preserving mess
  2018-10-10  1:26 TLSDESC register-preserving mess Rich Felker
  2018-10-10  2:35 ` Rich Felker
@ 2018-10-10 13:19 ` Szabolcs Nagy
  2018-10-10 13:52   ` Rich Felker
  1 sibling, 1 reply; 4+ messages in thread
From: Szabolcs Nagy @ 2018-10-10 13:19 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2018-10-09 21:26:20 -0400]:
> As written, the aarch64 and arm asm save and restore float/vector
> registers around the call, but I don't think they're future-proof
> against ISA extensions that add more such registers; if libc were

at least on aarch64 for now the approach is to add new vector
registers to the tlsdesc clobber list in gcc (and document
this in the sysv abi, except that's not published yet).

the reasoning is that it makes it safe to use tlsdesc with
old dynamic linker (new vector registers overlap with old
ones so old dynamic linker can clobber them) without much
practical cost: it's unlikely that vector code needs to
access tls (in vectorized loops the address is hopefully
computed outside the loop and vector math code should not
use tls state in the fast path if it wants to be efficient)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TLSDESC register-preserving mess
  2018-10-10 13:19 ` Szabolcs Nagy
@ 2018-10-10 13:52   ` Rich Felker
  0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2018-10-10 13:52 UTC (permalink / raw)
  To: musl

On Wed, Oct 10, 2018 at 03:19:26PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2018-10-09 21:26:20 -0400]:
> > As written, the aarch64 and arm asm save and restore float/vector
> > registers around the call, but I don't think they're future-proof
> > against ISA extensions that add more such registers; if libc were
> 
> at least on aarch64 for now the approach is to add new vector
> registers to the tlsdesc clobber list in gcc (and document
> this in the sysv abi, except that's not published yet).
> 
> the reasoning is that it makes it safe to use tlsdesc with
> old dynamic linker (new vector registers overlap with old
> ones so old dynamic linker can clobber them) without much
> practical cost: it's unlikely that vector code needs to
> access tls (in vectorized loops the address is hopefully
> computed outside the loop and vector math code should not
> use tls state in the fast path if it wants to be efficient)

Any idea if other archs are willing to commit to the same?

Even if they are, the second idea of getting rid of __tls_get_new
entirely is still somewhat appealing, in that it makes all dynamic TLS
access faster and reduces the amount of asm needed. But a committment
not to add new call-saved registers to the TLSDESC ABIs would solve
the immediate problem (albeit with some hwcap fiddling for 32-bit x86
where mmx, sse, etc. perhaps need to be saved conditionally).

Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-10 13:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10  1:26 TLSDESC register-preserving mess Rich Felker
2018-10-10  2:35 ` Rich Felker
2018-10-10 13:19 ` Szabolcs Nagy
2018-10-10 13:52   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).