mailing list of musl libc
 help / color / mirror / code / Atom feed
* Does TD point to itself intentionally?
@ 2019-03-30 10:38 Markus Wichmann
  2019-03-30 11:11 ` Frediano Ziglio
  2019-03-30 14:39 ` Rich Felker
  0 siblings, 2 replies; 6+ messages in thread
From: Markus Wichmann @ 2019-03-30 10:38 UTC (permalink / raw)
  To: musl

Hi all,

I was looking over my old C experiments and saw an old file, trying to
use clang's address_space attribute to access something like a thread
pointer. That made me wonder how it is implemented in musl.

In most architectures, the thread pointer is just stored in a register,
and __pthread_self() will just grab it out of there. For x86_64,
something slightly similar happens: The thread pointer is stored in
FS.base, which is an MSR the kernel has to set for us, but we can read
it with FS-relative addressing.

Incidentally: Is there any interest in using the "wrfsbase" instruction
for that, where available? From a cursory first glance, it looks like
that would mean that musl would have to do the entire CPUID dance on
AMD64 and i386, and in the latter case the dance would be a bit longer
since the ID bit dance would have to preceed it.

Back to setting the thread pointer: The relevant code is in __init_tp(),
which is always called with the return value from __copy_tls(), which
points to the new thread descriptor. __init_tp() will then call
__set_thread_area() with the adjusted thread pointer, and on AMD64, this
will just call arch_prctl(SET_FS, p). Though I don't know why that
function has to be in assembly.

OK, got it. After this, FS.base will point directly at the TD, so we can
just load FS.base into any register and have a thread pointer, right?
Enter __pthread_self():

static inline struct pthread *__pthread_self()
{
	struct pthread *self;
	__asm__ ("mov %%fs:0,%0" : "=r" (self) );
	return self;
}

But that is not the same thing! This will load FS.base, and then
dereference it and load the qword it is pointing at into a register. So
how did this ever work? Well, the answer is back in __init_tp():

	td->self = td;

And of course, "self" is the first member of struct pthread.

So, now the question I've been building up to: Is that intentional? Is
there a reason for there to be a pointer pointing to itself, other than
the "mov" in __pthread_self()? Could that mov not be replaced with a
"lea" and save one useless memory access?

Ciao,
Markus


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Does TD point to itself intentionally?
  2019-03-30 10:38 Does TD point to itself intentionally? Markus Wichmann
@ 2019-03-30 11:11 ` Frediano Ziglio
  2019-03-30 12:57   ` Markus Wichmann
  2019-03-30 14:39 ` Rich Felker
  1 sibling, 1 reply; 6+ messages in thread
From: Frediano Ziglio @ 2019-03-30 11:11 UTC (permalink / raw)
  To: musl

> 
> Hi all,
> 
> I was looking over my old C experiments and saw an old file, trying to
> use clang's address_space attribute to access something like a thread
> pointer. That made me wonder how it is implemented in musl.
> 
> In most architectures, the thread pointer is just stored in a register,
> and __pthread_self() will just grab it out of there. For x86_64,
> something slightly similar happens: The thread pointer is stored in
> FS.base, which is an MSR the kernel has to set for us, but we can read
> it with FS-relative addressing.
> 
> Incidentally: Is there any interest in using the "wrfsbase" instruction
> for that, where available? From a cursory first glance, it looks like
> that would mean that musl would have to do the entire CPUID dance on
> AMD64 and i386, and in the latter case the dance would be a bit longer
> since the ID bit dance would have to preceed it.
> 
> Back to setting the thread pointer: The relevant code is in __init_tp(),
> which is always called with the return value from __copy_tls(), which
> points to the new thread descriptor. __init_tp() will then call
> __set_thread_area() with the adjusted thread pointer, and on AMD64, this
> will just call arch_prctl(SET_FS, p). Though I don't know why that
> function has to be in assembly.
> 
> OK, got it. After this, FS.base will point directly at the TD, so we can
> just load FS.base into any register and have a thread pointer, right?
> Enter __pthread_self():
> 
> static inline struct pthread *__pthread_self()
> {
> 	struct pthread *self;
> 	__asm__ ("mov %%fs:0,%0" : "=r" (self) );
> 	return self;
> }
> 
> But that is not the same thing! This will load FS.base, and then
> dereference it and load the qword it is pointing at into a register. So
> how did this ever work? Well, the answer is back in __init_tp():
> 
> 	td->self = td;
> 
> And of course, "self" is the first member of struct pthread.
> 
> So, now the question I've been building up to: Is that intentional? Is
> there a reason for there to be a pointer pointing to itself, other than
> the "mov" in __pthread_self()? Could that mov not be replaced with a
> "lea" and save one useless memory access?
> 
> Ciao,
> Markus
> 

But "lea" how? It would be a rdfsbase instruction as "standard" registers
are used for other purposes. But as you said you cannot assume rdfsbase would
work so it's hard to inline it. Doing that way you can inline that single
assembly instruction easily.

Frediano


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Does TD point to itself intentionally?
  2019-03-30 11:11 ` Frediano Ziglio
@ 2019-03-30 12:57   ` Markus Wichmann
  2019-03-30 13:18     ` Frediano Ziglio
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Wichmann @ 2019-03-30 12:57 UTC (permalink / raw)
  To: musl

On Sat, Mar 30, 2019 at 07:11:41AM -0400, Frediano Ziglio wrote:
> But "lea" how? It would be a rdfsbase instruction as "standard" registers
> are used for other purposes. But as you said you cannot assume rdfsbase would
> work so it's hard to inline it. Doing that way you can inline that single
> assembly instruction easily.
>
> Frediano

I don't understand the objection. I was talking about replacing
__pthread_self() with:

asm ("lea %%fs:0, %0" : "=r"(self));

In case you are unfamilliar with that instruction: If the %0 were
replaced with %rax, this would assemble to the opcode:

64 40 8d 04 25 00 00 00 00

My god... having written this down, it would apparently be cheaper (code
size wise) to encode

xorl %eax,%eax
leaq %fs:(%rax),%rax

Because in 64-bit mode you need a SIB byte to encode absolute addresses,
and the SIB byte in this mode only does 32-bit displacements. Let's see...

31 C0
64 40 8d 00

Yep. 9 bytes vs. 6 bytes. But now I'm micro-optimizing. Though this
optimization would also be valid for the current implementation.
Something like:

static inline struct pthread *__pthread_self()
{
#ifdef MY_PATCH
#define INST "lea"
#else
#define INST "mov"
#endif
	struct pthread *self = 0;
	__asm__ (INST " %%fs:0,%0" : "+r" (self) );
	return self;
}

My question was more about removing this conceptual hurdle, and making
it more clear that FS indeed points to the thread descriptor, and not a
pointer to the thread descriptor. I know full well we can't remove
"self", nor skip the initialization, since both of these are ABI.

Ciao,
Markus


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Does TD point to itself intentionally?
  2019-03-30 12:57   ` Markus Wichmann
@ 2019-03-30 13:18     ` Frediano Ziglio
  0 siblings, 0 replies; 6+ messages in thread
From: Frediano Ziglio @ 2019-03-30 13:18 UTC (permalink / raw)
  To: musl

> On Sat, Mar 30, 2019 at 07:11:41AM -0400, Frediano Ziglio wrote:
> > But "lea" how? It would be a rdfsbase instruction as "standard" registers
> > are used for other purposes. But as you said you cannot assume rdfsbase
> > would
> > work so it's hard to inline it. Doing that way you can inline that single
> > assembly instruction easily.
> >
> > Frediano
> 
> I don't understand the objection. I was talking about replacing
> __pthread_self() with:
> 
> asm ("lea %%fs:0, %0" : "=r"(self));
> 
> In case you are unfamilliar with that instruction: If the %0 were
> replaced with %rax, this would assemble to the opcode:
> 
> 64 40 8d 04 25 00 00 00 00
> 
> My god... having written this down, it would apparently be cheaper (code
> size wise) to encode
> 
> xorl %eax,%eax
> leaq %fs:(%rax),%rax
> 

The base is not taken into account, this will produce a 0.

> Because in 64-bit mode you need a SIB byte to encode absolute addresses,
> and the SIB byte in this mode only does 32-bit displacements. Let's see...
> 
> 31 C0
> 64 40 8d 00
> 
> Yep. 9 bytes vs. 6 bytes. But now I'm micro-optimizing. Though this
> optimization would also be valid for the current implementation.
> Something like:
> 
> static inline struct pthread *__pthread_self()
> {
> #ifdef MY_PATCH
> #define INST "lea"
> #else
> #define INST "mov"
> #endif
> 	struct pthread *self = 0;
> 	__asm__ (INST " %%fs:0,%0" : "+r" (self) );
> 	return self;
> }
> 
> My question was more about removing this conceptual hurdle, and making
> it more clear that FS indeed points to the thread descriptor, and not a
> pointer to the thread descriptor. I know full well we can't remove
> "self", nor skip the initialization, since both of these are ABI.
> 
> Ciao,
> Markus
> 

Frediano


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Does TD point to itself intentionally?
  2019-03-30 10:38 Does TD point to itself intentionally? Markus Wichmann
  2019-03-30 11:11 ` Frediano Ziglio
@ 2019-03-30 14:39 ` Rich Felker
  2019-03-30 16:36   ` Markus Wichmann
  1 sibling, 1 reply; 6+ messages in thread
From: Rich Felker @ 2019-03-30 14:39 UTC (permalink / raw)
  To: musl

On Sat, Mar 30, 2019 at 11:38:14AM +0100, Markus Wichmann wrote:
> Hi all,
> 
> I was looking over my old C experiments and saw an old file, trying to
> use clang's address_space attribute to access something like a thread
> pointer. That made me wonder how it is implemented in musl.

I've experimented with using the equivalent in GCC to get musl to
generate %gs:offset or %fs:offset for access to fields in the thread
structure. Unfortunately you need -fasm or they silently don't work --
see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87626 for details. It
does help code generation somewhat and gave measurable performance
benefits in microbenchmarks (mainly due to reducing register
pressure), but would require making separate __self() or something
that returns the address-spaced pointer whose value is not valid for
assignment to pointers or passing as an argment like __pthread_self()
needs to be. Also, experiments showed that GCC generated multiple
instances of __self() on archs where the asm to load the thread
pointer was actually more expensive than caching the result in a
register. This was able to be partly mitigated by adding some \n\n\n
to the asm... *facepalm*

> In most architectures, the thread pointer is just stored in a register,
> and __pthread_self() will just grab it out of there. For x86_64,
> something slightly similar happens: The thread pointer is stored in
> FS.base, which is an MSR the kernel has to set for us, but we can read
> it with FS-relative addressing.
> 
> Incidentally: Is there any interest in using the "wrfsbase" instruction
> for that, where available? From a cursory first glance, it looks like
> that would mean that musl would have to do the entire CPUID dance on
> AMD64 and i386, and in the latter case the dance would be a bit longer
> since the ID bit dance would have to preceed it.

No. Even a single insn to test the stored result of whether such a
feature is available (in practice it would take several and a branch)
is more expensive than loading from %fs:0. And even without having to
make a runtime test, it should be the same cost, possibly still more
expensive, than loading from %fs:0.

> Back to setting the thread pointer: The relevant code is in __init_tp(),
> which is always called with the return value from __copy_tls(), which
> points to the new thread descriptor. __init_tp() will then call
> __set_thread_area() with the adjusted thread pointer, and on AMD64, this
> will just call arch_prctl(SET_FS, p). Though I don't know why that
> function has to be in assembly.
> 
> OK, got it. After this, FS.base will point directly at the TD, so we can
> just load FS.base into any register and have a thread pointer, right?
> Enter __pthread_self():
> 
> static inline struct pthread *__pthread_self()
> {
> 	struct pthread *self;
> 	__asm__ ("mov %%fs:0,%0" : "=r" (self) );
> 	return self;
> }
> 
> But that is not the same thing! This will load FS.base, and then
> dereference it and load the qword it is pointing at into a register. So
> how did this ever work? Well, the answer is back in __init_tp():
> 
> 	td->self = td;
> 
> And of course, "self" is the first member of struct pthread.
> 
> So, now the question I've been building up to: Is that intentional? Is

Yes, this is intentional. It's the documented ABI for x86[_64], and
necessary for the operation of code generated by a compiler
conforming to the ABI that takes &tlsvar via the initial-exec or
local-exec model.

> there a reason for there to be a pointer pointing to itself, other than
> the "mov" in __pthread_self()? Could that mov not be replaced with a
> "lea" and save one useless memory access?

The effective address computed by lea would be relative to %fs or %gs.
It's not useful.

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Does TD point to itself intentionally?
  2019-03-30 14:39 ` Rich Felker
@ 2019-03-30 16:36   ` Markus Wichmann
  0 siblings, 0 replies; 6+ messages in thread
From: Markus Wichmann @ 2019-03-30 16:36 UTC (permalink / raw)
  To: musl

On Sat, Mar 30, 2019 at 10:39:39AM -0400, Rich Felker wrote:
> This was able to be partly mitigated by adding some \n\n\n
> to the asm... *facepalm*
>

That is so GCC...

> No. Even a single insn to test the stored result of whether such a
> feature is available (in practice it would take several and a branch)
> is more expensive than loading from %fs:0. And even without having to
> make a runtime test, it should be the same cost, possibly still more
> expensive, than loading from %fs:0.
>

No, I meant, use wrfsbase instead of arch_prctl() in
__set_thread_area(). But as far as I can see, on AMD64 and i386, __hwcap
is just the EDX of CPUID function 1. But we'd need EBX bit 0 of CPUID
function 7, with ECX = 0.

> The effective address computed by lea would be relative to %fs or %gs.
> It's not useful.
>
> Rich

I just noticed that this fact is very well hidden in the documentation.
It is never spelled out, but the docs do say that LEA calculates the
effective address. And if you then open the AMD APM volume 1, and read
up on what an effective address is, which you have to do under the
heading "Memory Management", not "Effective Addresses", of course,
*then* you will find a nice graphic that tells you that the effective
address did not have segmentation applied, yet. And it also suggests
that segmentation doesn't exist in 64-bit mode. Which is laughable,
considering what we are talking about right now.

So yeah, you do have to dig pretty deep to find that small potato.

Are the Intel docs any better? If so, I might have to switch.

Ciao,
Markus


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-03-30 16:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-30 10:38 Does TD point to itself intentionally? Markus Wichmann
2019-03-30 11:11 ` Frediano Ziglio
2019-03-30 12:57   ` Markus Wichmann
2019-03-30 13:18     ` Frediano Ziglio
2019-03-30 14:39 ` Rich Felker
2019-03-30 16:36   ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).