mailing list of musl libc
 help / color / mirror / code / Atom feed
* TLS issue on aarch64
@ 2018-05-25 12:40 Phillip Berndt
  2018-05-25 14:50 ` Szabolcs Nagy
  0 siblings, 1 reply; 15+ messages in thread
From: Phillip Berndt @ 2018-05-25 12:40 UTC (permalink / raw)
  To: musl

Hi,

I'm experiencing a TLS-related error with musl on aarch64. This is my
test program:

----8<--------------
#include <stdio.h>

__thread int foo = 1234;
__thread int bar __attribute__((aligned(0x100))) = 5678;

int main(int argc, char **argv)
{
    printf("0x%p: %d\n", &foo, foo);
    printf("0x%p: %d\n", &bar, bar);

    return 0;
}
---->8---------------

I'm compiling this into a static binary with -O2 -static. With glibc and
gcc 5.4.0 (tried with 7.2 as well), this gives me the expected output. With
musl libc 1.1.19, I instead see

----8<--------------
0x0x7fa7adf2f0: 0
0x0x7fa7adf3f0: 0
---->8---------------

Note that this is the wrong address (not aligned), and that the memory has
unexpected content as well.

I did some initial debugging, but now I'm stuck and need some help. What I've
found so far:

* GCC apparently emits code that expects the tpidr_el0 register to contain a
  pointer to the TLS memory, and it expects that the loader unconditionally
  offsets the first variable by the TLS alignment into said memory:

  Disassembly of the code that loads &foo:
  ----8<--------------
  4001a4:       d53bd053        mrs     x19, tpidr_el0
  4001a8:       91400273        add     x19, x19, #0x0, lsl #12
  4001ac:       91040273        add     x19, x19, #0x100
  ----8<--------------

  (If I align the variable by 0x1000 instead then the code changes
   acoordingly.)

* Musl, on the other hand, in __copy_tls, initializes tpidr_el0 with a
  pointer 16 bytes from the end of struct pthread, and copies the TLS
  initializer code directly behind that struct, without adding extra
  padding.

Hence the code tries to access the TLS variables at the wrong location.

The following patch fixes the issue, but only if musl is then compiled with
optimizations off. With optimizations, the compiler emits the *same* code for
both variants. Also, the patch probably has some unexpected side-effects, too -
I'm just adding it here as a starting point for further debugging.

Any help is greatly appreciated :-)

- Phillip

----

diff --git a/arch/aarch64/pthread_arch.h b/arch/aarch64/pthread_arch.h
index b2e2d8f..c69f6f1 100644
--- a/arch/aarch64/pthread_arch.h
+++ b/arch/aarch64/pthread_arch.h
@@ -2,10 +2,10 @@ static inline struct pthread *__pthread_self()
 {
        char *self;
        __asm__ __volatile__ ("mrs %0,tpidr_el0" : "=r"(self));
-       return (void*)(self + 16 - sizeof(struct pthread));
+       return (void*)(self - sizeof(struct pthread));
 }

 #define TLS_ABOVE_TP
-#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread) - 16)
+#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread))

 #define MC_PC pc
diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
index b125eb1..3a3c307 100644
--- a/src/env/__init_tls.c
+++ b/src/env/__init_tls.c
@@ -42,7 +42,7 @@ void *__copy_tls(unsigned char *mem)

        mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
        td = (pthread_t)mem;
-       mem += sizeof(struct pthread);
+       mem += sizeof(struct pthread) + libc.tls_align;

        for (i=1, p=libc.tls_head; p; i++, p=p->next) {
                dtv[i] = mem + p->offset;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-25 12:40 TLS issue on aarch64 Phillip Berndt
@ 2018-05-25 14:50 ` Szabolcs Nagy
  2018-05-25 21:38   ` Szabolcs Nagy
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Szabolcs Nagy @ 2018-05-25 14:50 UTC (permalink / raw)
  To: musl

* Phillip Berndt <phillip.berndt@googlemail.com> [2018-05-25 14:40:14 +0200]:
> I'm experiencing a TLS-related error with musl on aarch64. This is my
> test program:
> 
> ----8<--------------
> #include <stdio.h>
> 
> __thread int foo = 1234;
> __thread int bar __attribute__((aligned(0x100))) = 5678;
> 
> int main(int argc, char **argv)
> {
>     printf("0x%p: %d\n", &foo, foo);
>     printf("0x%p: %d\n", &bar, bar);
> 
>     return 0;
> }
> ---->8---------------
> 
> I'm compiling this into a static binary with -O2 -static. With glibc and
> gcc 5.4.0 (tried with 7.2 as well), this gives me the expected output. With
> musl libc 1.1.19, I instead see
> 
> ----8<--------------
> 0x0x7fa7adf2f0: 0
> 0x0x7fa7adf3f0: 0
> ---->8---------------
> 
> Note that this is the wrong address (not aligned), and that the memory has
> unexpected content as well.
> 
> I did some initial debugging, but now I'm stuck and need some help. What I've
> found so far:
> 
> * GCC apparently emits code that expects the tpidr_el0 register to contain a
>   pointer to the TLS memory, and it expects that the loader unconditionally
>   offsets the first variable by the TLS alignment into said memory:
> 
>   Disassembly of the code that loads &foo:
>   ----8<--------------
>   4001a4:       d53bd053        mrs     x19, tpidr_el0
>   4001a8:       91400273        add     x19, x19, #0x0, lsl #12
>   4001ac:       91040273        add     x19, x19, #0x100
>   ----8<--------------
> 
>   (If I align the variable by 0x1000 instead then the code changes
>    acoordingly.)
> 
> * Musl, on the other hand, in __copy_tls, initializes tpidr_el0 with a
>   pointer 16 bytes from the end of struct pthread, and copies the TLS
>   initializer code directly behind that struct, without adding extra
>   padding.
> 


yeah i think on aarch64 (which is TLS_ABOVE_TP) musl expects a layout

  -------------------------- ----------- - -
 |  struct pthread | 16     |       tlsvar
  -------------------------- ----------- - -
 ^                 ^         <----->
 self              tp         tls offset
 (aligned)

while it should be

  -------------------------- ----------- - -
 |  struct pthread | 16     |       tlsvar
  -------------------------- ----------- - -
 ^                 ^<-------------->
 self              tp    tls offset
                   (aligned)

i think the constraints for tp are:

- tp must be aligned to 'tls_align'

- tp must be at a small fixed offset from the end
of pthread struct (so asm code can access the dtv)

- tp + off must be usable memory for tls for off >= 16
(this is aarch64 specific)

i'm not yet sure what's the best fix.

> Hence the code tries to access the TLS variables at the wrong location.
> 
> The following patch fixes the issue, but only if musl is then compiled with
> optimizations off. With optimizations, the compiler emits the *same* code for
> both variants. Also, the patch probably has some unexpected side-effects, too -
> I'm just adding it here as a starting point for further debugging.
> 
> Any help is greatly appreciated :-)
> 
> - Phillip
> 
> ----
> 
> diff --git a/arch/aarch64/pthread_arch.h b/arch/aarch64/pthread_arch.h
> index b2e2d8f..c69f6f1 100644
> --- a/arch/aarch64/pthread_arch.h
> +++ b/arch/aarch64/pthread_arch.h
> @@ -2,10 +2,10 @@ static inline struct pthread *__pthread_self()
>  {
>         char *self;
>         __asm__ __volatile__ ("mrs %0,tpidr_el0" : "=r"(self));
> -       return (void*)(self + 16 - sizeof(struct pthread));
> +       return (void*)(self - sizeof(struct pthread));
>  }
> 
>  #define TLS_ABOVE_TP
> -#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread) - 16)
> +#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread))
> 

this is ok, but wastes 16 bytes after the pthread struct
where will be no tls data (and i guess the allocated tls
size should be adjusted accordingly as well as tlsdesc.s).

>  #define MC_PC pc
> diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> index b125eb1..3a3c307 100644
> --- a/src/env/__init_tls.c
> +++ b/src/env/__init_tls.c
> @@ -42,7 +42,7 @@ void *__copy_tls(unsigned char *mem)
> 
>         mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
>         td = (pthread_t)mem;
> -       mem += sizeof(struct pthread);
> +       mem += sizeof(struct pthread) + libc.tls_align;
> 

i don't think this works with the above TP_ADJ, but
yes something has to be done differently here to make
aarch64 happy.

>         for (i=1, p=libc.tls_head; p; i++, p=p->next) {
>                 dtv[i] = mem + p->offset;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-25 14:50 ` Szabolcs Nagy
@ 2018-05-25 21:38   ` Szabolcs Nagy
  2018-05-25 22:20   ` Phillip Berndt
  2018-05-27  0:17   ` Rich Felker
  2 siblings, 0 replies; 15+ messages in thread
From: Szabolcs Nagy @ 2018-05-25 21:38 UTC (permalink / raw)
  To: musl

* Szabolcs Nagy <nsz@port70.net> [2018-05-25 16:50:59 +0200]:
> * Phillip Berndt <phillip.berndt@googlemail.com> [2018-05-25 14:40:14 +0200]:
> > I'm experiencing a TLS-related error with musl on aarch64. This is my
> > test program:
> > 
> > ----8<--------------
> > #include <stdio.h>
> > 
> > __thread int foo = 1234;
> > __thread int bar __attribute__((aligned(0x100))) = 5678;
> > 
> > int main(int argc, char **argv)
> > {
> >     printf("0x%p: %d\n", &foo, foo);
> >     printf("0x%p: %d\n", &bar, bar);
> > 
> >     return 0;
> > }
> > ---->8---------------

verified that this fails on arm and sh4 too,
these targets have non-trivial TP_ADJ, i suspect
we might need some further macro to do arch specific
tls alignment correctly.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-25 14:50 ` Szabolcs Nagy
  2018-05-25 21:38   ` Szabolcs Nagy
@ 2018-05-25 22:20   ` Phillip Berndt
  2018-05-26  0:54     ` Szabolcs Nagy
  2018-05-27  0:17   ` Rich Felker
  2 siblings, 1 reply; 15+ messages in thread
From: Phillip Berndt @ 2018-05-25 22:20 UTC (permalink / raw)
  To: musl

2018-05-25 16:50 GMT+02:00 Szabolcs Nagy <nsz@port70.net>:
> i think the constraints for tp are:
>
> - tp must be aligned to 'tls_align'
>
> - tp must be at a small fixed offset from the end
> of pthread struct (so asm code can access the dtv)
>
> - tp + off must be usable memory for tls for off >= 16
> (this is aarch64 specific)
>

Hmm.. but these constraints do not explain the extra offset of one
alignment I'm seeing in the GCC output, do they? If I compile a
program with a single TLS variable with
__attribute__((aligned(n)) that does nothing but try to reference and
print said variable, I get the
following assembler code from GCC:

For n = 0x1000:

  400194:       d53bd041        mrs     x1, tpidr_el0
  400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
  40019c:       91400421        add     x1, x1, #0x1, lsl #12
  4001a0:       91000021        add     x1, x1, #0x0


For n = 0x100:

  400194:       d53bd041        mrs     x1, tpidr_el0
  400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
  40019c:       91400021        add     x1, x1, #0x0, lsl #12
  4001a0:       91040021        add     x1, x1, #0x100

For n = 0x10:

  400194:       d53bd041        mrs     x1, tpidr_el0
  400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
  40019c:       91400021        add     x1, x1, #0x0, lsl #12
  4001a0:       91004021        add     x1, x1, #0x10

That's how I came up with the mem += libc.tls_align hack in the first place.

- Phillip


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-25 22:20   ` Phillip Berndt
@ 2018-05-26  0:54     ` Szabolcs Nagy
  2018-05-26  8:24       ` Phillip Berndt
  2018-05-27  0:34       ` Rich Felker
  0 siblings, 2 replies; 15+ messages in thread
From: Szabolcs Nagy @ 2018-05-26  0:54 UTC (permalink / raw)
  To: musl

* Phillip Berndt <phillip.berndt@googlemail.com> [2018-05-26 00:20:04 +0200]:
> 2018-05-25 16:50 GMT+02:00 Szabolcs Nagy <nsz@port70.net>:
> > i think the constraints for tp are:
> >
> > - tp must be aligned to 'tls_align'
> >
> > - tp must be at a small fixed offset from the end
> > of pthread struct (so asm code can access the dtv)
> >
> > - tp + off must be usable memory for tls for off >= 16
> > (this is aarch64 specific)
> >
> 
> Hmm.. but these constraints do not explain the extra offset of one
> alignment I'm seeing in the GCC output, do they? If I compile a

tp must be aligned and tp + offset must be aligned too,
but offset >= 16 has to hold.

> program with a single TLS variable with
> __attribute__((aligned(n)) that does nothing but try to reference and
> print said variable, I get the
> following assembler code from GCC:
> 
> For n = 0x1000:
> 
>   400194:       d53bd041        mrs     x1, tpidr_el0
>   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
>   40019c:       91400421        add     x1, x1, #0x1, lsl #12
>   4001a0:       91000021        add     x1, x1, #0x0
> 
> 
> For n = 0x100:
> 
>   400194:       d53bd041        mrs     x1, tpidr_el0
>   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
>   40019c:       91400021        add     x1, x1, #0x0, lsl #12
>   4001a0:       91040021        add     x1, x1, #0x100
> 
> For n = 0x10:
> 
>   400194:       d53bd041        mrs     x1, tpidr_el0
>   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
>   40019c:       91400021        add     x1, x1, #0x0, lsl #12
>   4001a0:       91004021        add     x1, x1, #0x10
> 
> That's how I came up with the mem += libc.tls_align hack in the first place.
> 

indeed you need another alignment there, i came up with the
following fix:

(on mips/ppc i expect it not to change anything: tp is
at a page aligned offset from the end of struct pthread,
so one alignment is enough there, but on aarch64/arm/sh4
this makes a difference, and seems to pass my simple tests)

diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
index 1c5d98a0..8e70024d 100644
--- a/src/env/__init_tls.c
+++ b/src/env/__init_tls.c
@@ -41,9 +41,12 @@ void *__copy_tls(unsigned char *mem)
 #ifdef TLS_ABOVE_TP
 	dtv = (void **)(mem + libc.tls_size) - (libc.tls_cnt + 1);
 
-	mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
+	/* Ensure TP is aligned.  */
+	mem += -(uintptr_t)TP_ADJ(mem) & (libc.tls_align-1);
 	td = (pthread_t)mem;
 	mem += sizeof(struct pthread);
+	/* Ensure TLS is aligned after struct pthread.  */
+	mem += -(uintptr_t)mem & (libc.tls_align-1);
 
 	for (i=1, p=libc.tls_head; p; i++, p=p->next) {
 		dtv[i] = mem + p->offset;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-26  0:54     ` Szabolcs Nagy
@ 2018-05-26  8:24       ` Phillip Berndt
  2018-05-27  0:34       ` Rich Felker
  1 sibling, 0 replies; 15+ messages in thread
From: Phillip Berndt @ 2018-05-26  8:24 UTC (permalink / raw)
  To: musl

2018-05-26 2:54 GMT+02:00 Szabolcs Nagy <nsz@port70.net>:
> indeed you need another alignment there, i came up with the
> following fix:
>
> (on mips/ppc i expect it not to change anything: tp is
> at a page aligned offset from the end of struct pthread,
> so one alignment is enough there, but on aarch64/arm/sh4
> this makes a difference, and seems to pass my simple tests)
>
> diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> index 1c5d98a0..8e70024d 100644
> --- a/src/env/__init_tls.c
> +++ b/src/env/__init_tls.c
> @@ -41,9 +41,12 @@ void *__copy_tls(unsigned char *mem)
>  #ifdef TLS_ABOVE_TP
>         dtv = (void **)(mem + libc.tls_size) - (libc.tls_cnt + 1);
>
> -       mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
> +       /* Ensure TP is aligned.  */
> +       mem += -(uintptr_t)TP_ADJ(mem) & (libc.tls_align-1);
>         td = (pthread_t)mem;
>         mem += sizeof(struct pthread);
> +       /* Ensure TLS is aligned after struct pthread.  */
> +       mem += -(uintptr_t)mem & (libc.tls_align-1);
>
>         for (i=1, p=libc.tls_head; p; i++, p=p->next) {
>                 dtv[i] = mem + p->offset;

I took the demo from my first mail and tested your patch with all alignments
up to the page size: They all seem to work fine. Awesome :)

- Phillip


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-25 14:50 ` Szabolcs Nagy
  2018-05-25 21:38   ` Szabolcs Nagy
  2018-05-25 22:20   ` Phillip Berndt
@ 2018-05-27  0:17   ` Rich Felker
  2 siblings, 0 replies; 15+ messages in thread
From: Rich Felker @ 2018-05-27  0:17 UTC (permalink / raw)
  To: musl

On Fri, May 25, 2018 at 04:50:59PM +0200, Szabolcs Nagy wrote:
> 
> yeah i think on aarch64 (which is TLS_ABOVE_TP) musl expects a layout
> 
>   -------------------------- ----------- - -
>  |  struct pthread | 16     |       tlsvar
>   -------------------------- ----------- - -
>  ^                 ^         <----->
>  self              tp         tls offset
>  (aligned)

I don't think this is correct; if so it's a bug. The intent is only
that musl expects the point where TLS begins (the beginning of your
tlsvar area above) to be aligned, tp to point 16 below it, and self to
point sizeof(struct pthread) below that.

> while it should be
> 
>   -------------------------- ----------- - -
>  |  struct pthread | 16     |       tlsvar
>   -------------------------- ----------- - -
>  ^                 ^<-------------->
>  self              tp    tls offset
>                    (aligned)

The "tls offset" described here (including the 16 reserved bytes)
doesn't seem to match the "tls offset" in the sense of offset from the
beginning of the PT_TLS image. That seems to be the core issue here.

> i think the constraints for tp are:
> 
> - tp must be aligned to 'tls_align'

I don't see this as an explicit requirement anywhere, but it falls out
implicitly if the linker assigns offsets beginning at
max(16,tls_align) (and adjusted by that much vs the PT_TLS segment).

> - tp must be at a small fixed offset from the end
> of pthread struct (so asm code can access the dtv)
> 
> - tp + off must be usable memory for tls for off >= 16
> (this is aarch64 specific)

The constraint is not "usable memory" but that it lines up with where
we're pasting the PT_TLS image.

> i'm not yet sure what's the best fix.
> 
> > Hence the code tries to access the TLS variables at the wrong location.
> > 
> > The following patch fixes the issue, but only if musl is then compiled with
> > optimizations off. With optimizations, the compiler emits the *same* code for
> > both variants. Also, the patch probably has some unexpected side-effects, too -
> > I'm just adding it here as a starting point for further debugging.
> > 
> > Any help is greatly appreciated :-)
> > 
> > - Phillip
> > 
> > ----
> > 
> > diff --git a/arch/aarch64/pthread_arch.h b/arch/aarch64/pthread_arch.h
> > index b2e2d8f..c69f6f1 100644
> > --- a/arch/aarch64/pthread_arch.h
> > +++ b/arch/aarch64/pthread_arch.h
> > @@ -2,10 +2,10 @@ static inline struct pthread *__pthread_self()
> >  {
> >         char *self;
> >         __asm__ __volatile__ ("mrs %0,tpidr_el0" : "=r"(self));
> > -       return (void*)(self + 16 - sizeof(struct pthread));
> > +       return (void*)(self - sizeof(struct pthread));
> >  }
> > 
> >  #define TLS_ABOVE_TP
> > -#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread) - 16)
> > +#define TP_ADJ(p) ((char *)(p) + sizeof(struct pthread))
> > 
> 
> this is ok, but wastes 16 bytes after the pthread struct
> where will be no tls data (and i guess the allocated tls
> size should be adjusted accordingly as well as tlsdesc.s).
> 
> >  #define MC_PC pc
> > diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> > index b125eb1..3a3c307 100644
> > --- a/src/env/__init_tls.c
> > +++ b/src/env/__init_tls.c
> > @@ -42,7 +42,7 @@ void *__copy_tls(unsigned char *mem)
> > 
> >         mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
> >         td = (pthread_t)mem;
> > -       mem += sizeof(struct pthread);
> > +       mem += sizeof(struct pthread) + libc.tls_align;
> > 
> 
> i don't think this works with the above TP_ADJ, but
> yes something has to be done differently here to make
> aarch64 happy.

As noted later in the thread, the issue does not seem to be
aarch64-specific, but rather applies whenever the TLS ABI has some
reserved space between the thread pointer and beginning of TLS
(nonzero offset in TP_ADJ? not exactly since on mips, etc. the offset
is not reserved space but just accounting for ridiculous signed 16-bit
offsets).

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-26  0:54     ` Szabolcs Nagy
  2018-05-26  8:24       ` Phillip Berndt
@ 2018-05-27  0:34       ` Rich Felker
  2018-05-28 20:47         ` Szabolcs Nagy
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2018-05-27  0:34 UTC (permalink / raw)
  To: musl

On Sat, May 26, 2018 at 02:54:16AM +0200, Szabolcs Nagy wrote:
> * Phillip Berndt <phillip.berndt@googlemail.com> [2018-05-26 00:20:04 +0200]:
> > 2018-05-25 16:50 GMT+02:00 Szabolcs Nagy <nsz@port70.net>:
> > > i think the constraints for tp are:
> > >
> > > - tp must be aligned to 'tls_align'
> > >
> > > - tp must be at a small fixed offset from the end
> > > of pthread struct (so asm code can access the dtv)
> > >
> > > - tp + off must be usable memory for tls for off >= 16
> > > (this is aarch64 specific)
> > >
> > 
> > Hmm.. but these constraints do not explain the extra offset of one
> > alignment I'm seeing in the GCC output, do they? If I compile a
> 
> tp must be aligned and tp + offset must be aligned too,
> but offset >= 16 has to hold.
> 
> > program with a single TLS variable with
> > __attribute__((aligned(n)) that does nothing but try to reference and
> > print said variable, I get the
> > following assembler code from GCC:
> > 
> > For n = 0x1000:
> > 
> >   400194:       d53bd041        mrs     x1, tpidr_el0
> >   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
> >   40019c:       91400421        add     x1, x1, #0x1, lsl #12
> >   4001a0:       91000021        add     x1, x1, #0x0
> > 
> > 
> > For n = 0x100:
> > 
> >   400194:       d53bd041        mrs     x1, tpidr_el0
> >   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
> >   40019c:       91400021        add     x1, x1, #0x0, lsl #12
> >   4001a0:       91040021        add     x1, x1, #0x100
> > 
> > For n = 0x10:
> > 
> >   400194:       d53bd041        mrs     x1, tpidr_el0
> >   400198:       b0000040        adrp    x0, 409000 <__subtf3+0xd18>
> >   40019c:       91400021        add     x1, x1, #0x0, lsl #12
> >   4001a0:       91004021        add     x1, x1, #0x10
> > 
> > That's how I came up with the mem += libc.tls_align hack in the first place.
> > 
> 
> indeed you need another alignment there, i came up with the
> following fix:
> 
> (on mips/ppc i expect it not to change anything: tp is
> at a page aligned offset from the end of struct pthread,
> so one alignment is enough there, but on aarch64/arm/sh4
> this makes a difference, and seems to pass my simple tests)
> 
> diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> index 1c5d98a0..8e70024d 100644
> --- a/src/env/__init_tls.c
> +++ b/src/env/__init_tls.c
> @@ -41,9 +41,12 @@ void *__copy_tls(unsigned char *mem)
>  #ifdef TLS_ABOVE_TP
>  	dtv = (void **)(mem + libc.tls_size) - (libc.tls_cnt + 1);
>  
> -	mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
> +	/* Ensure TP is aligned.  */
> +	mem += -(uintptr_t)TP_ADJ(mem) & (libc.tls_align-1);
>  	td = (pthread_t)mem;
>  	mem += sizeof(struct pthread);
> +	/* Ensure TLS is aligned after struct pthread.  */
> +	mem += -(uintptr_t)mem & (libc.tls_align-1);
>  
>  	for (i=1, p=libc.tls_head; p; i++, p=p->next) {
>  		dtv[i] = mem + p->offset;

As written this (or anything using libc.tls_align to adjust offset of
the TLS from the TP) is not valid. The value of libc.tls_align is
runtime-variable and will increase upon dlopen, and even without
dlopen, will be non-deterministic dependent on shared libraries from
DT_NEEDED in dynamic-linked programs. The offset between TP and TLS is
a property of the linker's handling of local-exec TLS in the main
program only, and thus probably should be using libc.tls_head.align.

However, care needs to be taken that libc.tls_head may initially be
null if the main program has no TLS, but could later become non-null
due to dlopen. If the offset between TP and TLS changed due to this,
any initial-exec-model TLS access would be wrong. Fortunately such a
program cannot have initial-exec-model accesses (initial-exec is only
valid for TLS that existed at program start), so we can probably just
ignore the issue and always use libc.tls_head?libc.tls_head.align:1;
this will cause gratuitous padding for threads created after dlopen of
a library with larger alignment, but should otherwise not hurt
anything.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-27  0:34       ` Rich Felker
@ 2018-05-28 20:47         ` Szabolcs Nagy
  2018-05-28 22:15           ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Szabolcs Nagy @ 2018-05-28 20:47 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2018-05-26 20:34:30 -0400]:
> On Sat, May 26, 2018 at 02:54:16AM +0200, Szabolcs Nagy wrote:
> > (on mips/ppc i expect it not to change anything: tp is
> > at a page aligned offset from the end of struct pthread,
> > so one alignment is enough there, but on aarch64/arm/sh4
> > this makes a difference, and seems to pass my simple tests)
> > 
> > diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> > index 1c5d98a0..8e70024d 100644
> > --- a/src/env/__init_tls.c
> > +++ b/src/env/__init_tls.c
> > @@ -41,9 +41,12 @@ void *__copy_tls(unsigned char *mem)
> >  #ifdef TLS_ABOVE_TP
> >  	dtv = (void **)(mem + libc.tls_size) - (libc.tls_cnt + 1);
> >  
> > -	mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
> > +	/* Ensure TP is aligned.  */
> > +	mem += -(uintptr_t)TP_ADJ(mem) & (libc.tls_align-1);
> >  	td = (pthread_t)mem;
> >  	mem += sizeof(struct pthread);
> > +	/* Ensure TLS is aligned after struct pthread.  */
> > +	mem += -(uintptr_t)mem & (libc.tls_align-1);
> >  
> >  	for (i=1, p=libc.tls_head; p; i++, p=p->next) {
> >  		dtv[i] = mem + p->offset;
> 
> As written this (or anything using libc.tls_align to adjust offset of
> the TLS from the TP) is not valid. The value of libc.tls_align is
> runtime-variable and will increase upon dlopen, and even without
> dlopen, will be non-deterministic dependent on shared libraries from
> DT_NEEDED in dynamic-linked programs. The offset between TP and TLS is
> a property of the linker's handling of local-exec TLS in the main
> program only, and thus probably should be using libc.tls_head.align.
> 

ok, makes sense.

> However, care needs to be taken that libc.tls_head may initially be
> null if the main program has no TLS, but could later become non-null
> due to dlopen. If the offset between TP and TLS changed due to this,
> any initial-exec-model TLS access would be wrong. Fortunately such a
> program cannot have initial-exec-model accesses (initial-exec is only
> valid for TLS that existed at program start), so we can probably just
> ignore the issue and always use libc.tls_head?libc.tls_head.align:1;
> this will cause gratuitous padding for threads created after dlopen of
> a library with larger alignment, but should otherwise not hurt
> anything.
> 

yes i think we only need to consider the tls alignment requirements
of the main executable, if libc.tls_head can only be changed by
loading libs with initial-exec tls that should be fine.

another issue with the patch is that if tp is aligned then pthread_t
may not get aligned:

tp == self + sizeof(pthread_t) - reserved

so sizeof(pthread_t) - reserved must be divisible with
gcd(alignment of tp, alignof(pthread_t)) to be able to make both
self and tp aligned.

this is not an issue on current targets with current pthread_t,
but we may want to decouple internal struct pthread alignment
details and the abi reserved tls size, i.e. tp_adj could be like

tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))

or we add a static assert that reserved and alignof(pthread_t)
are not conflicting.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-28 20:47         ` Szabolcs Nagy
@ 2018-05-28 22:15           ` Rich Felker
  2018-05-29  6:33             ` Szabolcs Nagy
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2018-05-28 22:15 UTC (permalink / raw)
  To: musl

On Mon, May 28, 2018 at 10:47:31PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2018-05-26 20:34:30 -0400]:
> > On Sat, May 26, 2018 at 02:54:16AM +0200, Szabolcs Nagy wrote:
> > > (on mips/ppc i expect it not to change anything: tp is
> > > at a page aligned offset from the end of struct pthread,
> > > so one alignment is enough there, but on aarch64/arm/sh4
> > > this makes a difference, and seems to pass my simple tests)
> > > 
> > > diff --git a/src/env/__init_tls.c b/src/env/__init_tls.c
> > > index 1c5d98a0..8e70024d 100644
> > > --- a/src/env/__init_tls.c
> > > +++ b/src/env/__init_tls.c
> > > @@ -41,9 +41,12 @@ void *__copy_tls(unsigned char *mem)
> > >  #ifdef TLS_ABOVE_TP
> > >  	dtv = (void **)(mem + libc.tls_size) - (libc.tls_cnt + 1);
> > >  
> > > -	mem += -((uintptr_t)mem + sizeof(struct pthread)) & (libc.tls_align-1);
> > > +	/* Ensure TP is aligned.  */
> > > +	mem += -(uintptr_t)TP_ADJ(mem) & (libc.tls_align-1);
> > >  	td = (pthread_t)mem;
> > >  	mem += sizeof(struct pthread);
> > > +	/* Ensure TLS is aligned after struct pthread.  */
> > > +	mem += -(uintptr_t)mem & (libc.tls_align-1);
> > >  
> > >  	for (i=1, p=libc.tls_head; p; i++, p=p->next) {
> > >  		dtv[i] = mem + p->offset;
> > 
> > As written this (or anything using libc.tls_align to adjust offset of
> > the TLS from the TP) is not valid. The value of libc.tls_align is
> > runtime-variable and will increase upon dlopen, and even without
> > dlopen, will be non-deterministic dependent on shared libraries from
> > DT_NEEDED in dynamic-linked programs. The offset between TP and TLS is
> > a property of the linker's handling of local-exec TLS in the main
> > program only, and thus probably should be using libc.tls_head.align.
> > 
> 
> ok, makes sense.
> 
> > However, care needs to be taken that libc.tls_head may initially be
> > null if the main program has no TLS, but could later become non-null
> > due to dlopen. If the offset between TP and TLS changed due to this,
> > any initial-exec-model TLS access would be wrong. Fortunately such a
> > program cannot have initial-exec-model accesses (initial-exec is only
> > valid for TLS that existed at program start), so we can probably just
> > ignore the issue and always use libc.tls_head?libc.tls_head.align:1;
> > this will cause gratuitous padding for threads created after dlopen of
> > a library with larger alignment, but should otherwise not hurt
> > anything.
> > 
> 
> yes i think we only need to consider the tls alignment requirements
> of the main executable, if libc.tls_head can only be changed by
> loading libs with initial-exec tls that should be fine.

The only transition possible for libc.tls_head is from null to
non-null, and it's only initially null if there is initially no TLS,
in which case the initial-exec (and local-exec) models are invalid.

> another issue with the patch is that if tp is aligned then pthread_t
> may not get aligned:
> 
> tp == self + sizeof(pthread_t) - reserved

This expression does not look correct; it would have tp point
somewhere inside of the pthread structure rather than just past the
end of it.

Maybe your code is doing this to avoid wasted padding, but if so I
think that's a bad idea. It violates the invariant that the pthread
structure is at a constant offset from tp, which is needed for
efficient pthread_self and for access to fixed slots at the end of it
(like canary or dtv copy).

> so sizeof(pthread_t) - reserved must be divisible with
> gcd(alignment of tp, alignof(pthread_t)) to be able to make both
> self and tp aligned.
> 
> this is not an issue on current targets with current pthread_t,
> but we may want to decouple internal struct pthread alignment
> details and the abi reserved tls size, i.e. tp_adj could be like
> 
> tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))
> 
> or we add a static assert that reserved and alignof(pthread_t)
> are not conflicting.

Maybe I'm misunderstanding what "reserved" is, since you're talking
about a static assert...?

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-28 22:15           ` Rich Felker
@ 2018-05-29  6:33             ` Szabolcs Nagy
  2018-05-31 15:22               ` Phillip Berndt
  2018-06-01  0:11               ` Szabolcs Nagy
  0 siblings, 2 replies; 15+ messages in thread
From: Szabolcs Nagy @ 2018-05-29  6:33 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2018-05-28 18:15:21 -0400]:
> On Mon, May 28, 2018 at 10:47:31PM +0200, Szabolcs Nagy wrote:
> > another issue with the patch is that if tp is aligned then pthread_t
> > may not get aligned:
> > 
> > tp == self + sizeof(pthread_t) - reserved
> 
> This expression does not look correct; it would have tp point
> somewhere inside of the pthread structure rather than just past the
> end of it.
> 

this is the current tp setup on
aarch64, arm and sh4, see TP_ADJ

we can just make tp = self+sizeof(pthread_t)
but then there will be a small unused hole

> Maybe your code is doing this to avoid wasted padding, but if so I
> think that's a bad idea. It violates the invariant that the pthread
> structure is at a constant offset from tp, which is needed for
> efficient pthread_self and for access to fixed slots at the end of it
> (like canary or dtv copy).
> 
> > so sizeof(pthread_t) - reserved must be divisible with
> > gcd(alignment of tp, alignof(pthread_t)) to be able to make both
> > self and tp aligned.
> > 
> > this is not an issue on current targets with current pthread_t,
> > but we may want to decouple internal struct pthread alignment
> > details and the abi reserved tls size, i.e. tp_adj could be like
> > 
> > tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))
> > 
> > or we add a static assert that reserved and alignof(pthread_t)
> > are not conflicting.
> 
> Maybe I'm misunderstanding what "reserved" is, since you're talking
> about a static assert...?
> 

it is the abi reserved space after tp
(the bfd linker calls it TCB_SIZE)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-29  6:33             ` Szabolcs Nagy
@ 2018-05-31 15:22               ` Phillip Berndt
  2018-06-01  0:11               ` Szabolcs Nagy
  1 sibling, 0 replies; 15+ messages in thread
From: Phillip Berndt @ 2018-05-31 15:22 UTC (permalink / raw)
  To: musl

Regarding the interaction with dynamic TLS:

I can't get dynamically linked programs to work at all with musl 1.1.19.
The third stage of the dynamic linker initializes the thread pointer
with the built-in TLS before doing anything else. But again, the pointer
doesn't end up where it's supposed to:

(gdb) p &builtin_tls.pt
$2 = (struct __pthread *) 0x5555642828 <builtin_tls+8>

(gdb) p __pthread_self()
$5 = (struct __pthread *) 0x5555642810 <buf+16>

Since this is the wrong address, the pthread structure isn't actually
zero-initialized, which for me means that the cancel state is non-zero,
which means that the linker exits at the first syscall (because
__syscall_cp_c() does that for canceled pthreads).

I've got a musl 1.1.18 dynamic linker flying around that at least doesn't
fail to start programs. Your discussion of dynamically linked objects got me
interested, so I wanted to see what happens if I run a program using TLS in
the main program, TLS from a statically linked library, TLS from a
dynamically linked library, and TLS from a shared library opened by
dlopen(), all at the same time. Here's what happens:

[           main] 0x0x415ff0: 0
[      SO_dlopen] 0x0x415ff0: 0
[      SO_linked] 0x0x415ff0: 0
[         static] 0x0x416ff0: 0

..note that all four variables have differently named symbols, they should
not end up at the same location. All should be page-aligned and have
non-zero values.

From what I can see, Szabolcs's patch still does the right thing for
statically linked applications. Those can't use dlopen() anyway.

Would it hence be an option to proceed with his patch, and fix dynamically
linked programs in the next step?

- Phillip


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-05-29  6:33             ` Szabolcs Nagy
  2018-05-31 15:22               ` Phillip Berndt
@ 2018-06-01  0:11               ` Szabolcs Nagy
  2018-06-01  0:52                 ` Rich Felker
  1 sibling, 1 reply; 15+ messages in thread
From: Szabolcs Nagy @ 2018-06-01  0:11 UTC (permalink / raw)
  To: musl

* Szabolcs Nagy <nsz@port70.net> [2018-05-29 08:33:17 +0200]:
> * Rich Felker <dalias@libc.org> [2018-05-28 18:15:21 -0400]:
> > On Mon, May 28, 2018 at 10:47:31PM +0200, Szabolcs Nagy wrote:
> > > another issue with the patch is that if tp is aligned then pthread_t
> > > may not get aligned:
> > > 
> > > tp == self + sizeof(pthread_t) - reserved
> > 
> > This expression does not look correct; it would have tp point
> > somewhere inside of the pthread structure rather than just past the
> > end of it.
> > 
> 
> this is the current tp setup on
> aarch64, arm and sh4, see TP_ADJ
> 
> we can just make tp = self+sizeof(pthread_t)
> but then there will be a small unused hole
> 
> > Maybe your code is doing this to avoid wasted padding, but if so I
> > think that's a bad idea. It violates the invariant that the pthread
> > structure is at a constant offset from tp, which is needed for
> > efficient pthread_self and for access to fixed slots at the end of it
> > (like canary or dtv copy).
> > 
> > > so sizeof(pthread_t) - reserved must be divisible with
> > > gcd(alignment of tp, alignof(pthread_t)) to be able to make both
> > > self and tp aligned.
> > > 
> > > this is not an issue on current targets with current pthread_t,
> > > but we may want to decouple internal struct pthread alignment
> > > details and the abi reserved tls size, i.e. tp_adj could be like
> > > 
> > > tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))
> > > 
> > > or we add a static assert that reserved and alignof(pthread_t)
> > > are not conflicting.
> > 
> > Maybe I'm misunderstanding what "reserved" is, since you're talking
> > about a static assert...?
> > 
> 
> it is the abi reserved space after tp
> (the bfd linker calls it TCB_SIZE)

did some digging into the bfd linker code, i dump here my understanding:

tls variant 1 has two sub variants: ppc (mips is the same) and
aarch64 (arm, sh are the same just s/16/8/ for the reserved space)

base: tls segment base address in the elf obj
align: tls segment alignment requirement in the elf obj
addr: address of a tls var in the tls segment of the elf obj
tp: thread pointer at runtime
ptr: address of tls var at runtime

local-exec (off: offset from tp in the code):
	code: ptr = tp + off

	ppc: off = addr - (base + 0x7000)
	aarch64: off = addr - (base - alignup(16, align))

initial-exec (got,add: REL_TPOFF got entry and value/addend):
	code: ptr = tp + *got

	ppc: add = addr - base
	aarch64: add = addr - base

	(*got is libc internal, but it should be setup like
	*got = module_tls_base_ptr - tp + add)

local-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value, off: offset in the code):
	code: ptr = __tls_get_addr(got) + off

	ppc: add = 0, off = addr - (base + 0x8000)
	aarch64: add = addr - base, off = 0

	(__tls_getaddr(got,modid) = dtv[modid] + *got + off +..., libc internal,
	e.g. dtv[modid] could point at a fixed offset from the allocated tls
	data for the module)

general-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value):
	code: ptr = __tls_get_addr(got) (= dtv[modid] + *got)

	ppc: add = addr - base
	aarch64: add = addr - base

ptr has correct alignment if the tls segment is mapped to an aligned address,
i.e. ptr - (addr - base) must be aligned, working backwards from this
the requirement is

for local-exec to work:
	ppc: tp - 0x7000 must be aligned
	aarch64: tp + alignup(16, align) must be aligned == tp must be aligned

for initial-exec to work:
	tp + *got - add must be aligned (i.e. *got has to be set up to meet
	the alignment requirement of the module, this does not seem to require
	realignment of tp so even runtime loading of initial-exec tls should
	be possible assuming there is enough space etc...)

for local-dynamic to work: dtv[modid] must be aligned

does this make sense?

i can come up with a new patch, but it's not clear if on aarch64 we want
to allow tp to point inside pthread_t or strictly point at the end (then
the 16 reserved bytes are unused), note that in glibc the 16 byte after tp
contains dtv and an unused ptr and it is expected that additional magic tls
data the compiler needs to access (e.g. stackend for split stack support or
stack canary) comes before tls (so the layout is pthread_t, some magic tls
data, tp, 16 byte with dtv, tls), for us only the dtv is important to access
with fixed offset from tp (even that's not necessary if we generate the load
offset in the tlsdesc asm based on sizeof(pthread_t) etc)

i haven't looked into how gdb tries to handle tls (without threaddb)
there might be further abi requirements for gdb to work well.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-06-01  0:11               ` Szabolcs Nagy
@ 2018-06-01  0:52                 ` Rich Felker
  2018-06-01  9:38                   ` Szabolcs Nagy
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2018-06-01  0:52 UTC (permalink / raw)
  To: musl

On Fri, Jun 01, 2018 at 02:11:02AM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2018-05-29 08:33:17 +0200]:
> > * Rich Felker <dalias@libc.org> [2018-05-28 18:15:21 -0400]:
> > > On Mon, May 28, 2018 at 10:47:31PM +0200, Szabolcs Nagy wrote:
> > > > another issue with the patch is that if tp is aligned then pthread_t
> > > > may not get aligned:
> > > > 
> > > > tp == self + sizeof(pthread_t) - reserved
> > > 
> > > This expression does not look correct; it would have tp point
> > > somewhere inside of the pthread structure rather than just past the
> > > end of it.
> > > 
> > 
> > this is the current tp setup on
> > aarch64, arm and sh4, see TP_ADJ
> > 
> > we can just make tp = self+sizeof(pthread_t)
> > but then there will be a small unused hole
> > 
> > > Maybe your code is doing this to avoid wasted padding, but if so I
> > > think that's a bad idea. It violates the invariant that the pthread
> > > structure is at a constant offset from tp, which is needed for
> > > efficient pthread_self and for access to fixed slots at the end of it
> > > (like canary or dtv copy).
> > > 
> > > > so sizeof(pthread_t) - reserved must be divisible with
> > > > gcd(alignment of tp, alignof(pthread_t)) to be able to make both
> > > > self and tp aligned.
> > > > 
> > > > this is not an issue on current targets with current pthread_t,
> > > > but we may want to decouple internal struct pthread alignment
> > > > details and the abi reserved tls size, i.e. tp_adj could be like
> > > > 
> > > > tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))
> > > > 
> > > > or we add a static assert that reserved and alignof(pthread_t)
> > > > are not conflicting.
> > > 
> > > Maybe I'm misunderstanding what "reserved" is, since you're talking
> > > about a static assert...?
> > > 
> > 
> > it is the abi reserved space after tp
> > (the bfd linker calls it TCB_SIZE)
> 
> did some digging into the bfd linker code, i dump here my understanding:
> 
> tls variant 1 has two sub variants: ppc (mips is the same) and
> aarch64 (arm, sh are the same just s/16/8/ for the reserved space)

This is very helpful.

> base: tls segment base address in the elf obj
> align: tls segment alignment requirement in the elf obj
> addr: address of a tls var in the tls segment of the elf obj
> tp: thread pointer at runtime
> ptr: address of tls var at runtime
> 
> local-exec (off: offset from tp in the code):
> 	code: ptr = tp + off
> 
> 	ppc: off = addr - (base + 0x7000)
> 	aarch64: off = addr - (base - alignup(16, align))
> 
> initial-exec (got,add: REL_TPOFF got entry and value/addend):
> [...]
> local-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value, off: offset in the code):
> [...]
> general-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value):
> [...]

None but local-exec are really linker ABI, except kinda in the case of
static linking if the linker fails to perform relaxations. A more
precise way to say this would be that only TLS which is accessible via
local-exec model has any constraints imposed on it by the linker as to
where it lies relative to TP. For initial-exec, the address relative
to TP must be constant across all threads, but it's not set by the
linker (except in the case of failed relaxation, in which case it's
just the local-exec model with the offset stored out-of-band). For
other models, there are no constraints at all; it's entirely up to the
dynamic linker implementation how it arranges for things to work.

> ptr has correct alignment if the tls segment is mapped to an aligned address,
> i.e. ptr - (addr - base) must be aligned, working backwards from this
> the requirement is
> 
> for local-exec to work:
> 	ppc: tp - 0x7000 must be aligned

This should be fine, since musl arranges for self+sizeof(struct
pthread) to be aligned, and that's exactly what tp-0x7000 is.

> 	aarch64: tp + alignup(16, align) must be aligned == tp must be aligned

OK, I see two possible solutions here:

1. tp==self+sizeof(struct pthread). In this case we'll waste some
space (vs the current approach) when no extra alignment is needed, but
it's simple and clean because all the alignments match up naturally.

2. tp==self+sizeof(struct pthread)-16 (or rather -reserved in
general). This preserves the current memory usage, but requires
complex new alignment logic since self will no longer be aligned mod
tls_align when tls_align>reserved.

I pretty strongly prefer option 1.

In either case, the main_tls.offset/app.tls.offset value needs to
correctly reflect the offset of the TLS from TP, so it either needs to
be alignup(reserved,tls_align) or alignup(reserved,tls_align)-reserved
depending on option 1 or 2. After that change is made, we need to make
sure the storage needs (libc.tls_size) are computed correctly and
account for the extra space due to the initial positive offset.

No change is then needed in __copy_tls.

Changes to TP_ADJ and __pthread_self are needed to get reserved out of
them, and the value of reserved needs to be provided somewhere else
for computing main_tls.offset.

> for initial-exec to work:
> 	tp + *got - add must be aligned (i.e. *got has to be set up to meet
> 	the alignment requirement of the module, this does not seem to require
> 	realignment of tp so even runtime loading of initial-exec tls should
> 	be possible assuming there is enough space etc...)

There's never space so it's not even a question, but even if there
were, no, it can't be done because tp will not be aligned mod some
possibly-larger alignment than the alignment in effect at the time the
thread was created.

> i can come up with a new patch, but it's not clear if on aarch64 we want
> to allow tp to point inside pthread_t or strictly point at the end (then
> the 16 reserved bytes are unused), note that in glibc the 16 byte after tp
> contains dtv and an unused ptr and it is expected that additional magic tls
> data the compiler needs to access (e.g. stackend for split stack support or
> stack canary) comes before tls (so the layout is pthread_t, some magic tls
> data, tp, 16 byte with dtv, tls), for us only the dtv is important to access
> with fixed offset from tp (even that's not necessary if we generate the load
> offset in the tlsdesc asm based on sizeof(pthread_t) etc)
> 
> i haven't looked into how gdb tries to handle tls (without threaddb)
> there might be further abi requirements for gdb to work well.

I don't think gdb is working reasonably at all right now for accessing
TLS on musl. It really should just be enhanced to inject calls to
__tls_get_addr and/or dlsym.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: TLS issue on aarch64
  2018-06-01  0:52                 ` Rich Felker
@ 2018-06-01  9:38                   ` Szabolcs Nagy
  0 siblings, 0 replies; 15+ messages in thread
From: Szabolcs Nagy @ 2018-06-01  9:38 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2018-05-31 20:52:00 -0400]:
> On Fri, Jun 01, 2018 at 02:11:02AM +0200, Szabolcs Nagy wrote:
> > 	aarch64: tp + alignup(16, align) must be aligned == tp must be aligned
> 
> OK, I see two possible solutions here:
> 
> 1. tp==self+sizeof(struct pthread). In this case we'll waste some
> space (vs the current approach) when no extra alignment is needed, but
> it's simple and clean because all the alignments match up naturally.
> 
> 2. tp==self+sizeof(struct pthread)-16 (or rather -reserved in
> general). This preserves the current memory usage, but requires
> complex new alignment logic since self will no longer be aligned mod
> tls_align when tls_align>reserved.
> 
> I pretty strongly prefer option 1.
> 

ok.

> In either case, the main_tls.offset/app.tls.offset value needs to
> correctly reflect the offset of the TLS from TP, so it either needs to
> be alignup(reserved,tls_align) or alignup(reserved,tls_align)-reserved
> depending on option 1 or 2. After that change is made, we need to make
> sure the storage needs (libc.tls_size) are computed correctly and
> account for the extra space due to the initial positive offset.
> 
> No change is then needed in __copy_tls.
> 
> Changes to TP_ADJ and __pthread_self are needed to get reserved out of
> them, and the value of reserved needs to be provided somewhere else
> for computing main_tls.offset.
> 

ok.
i'll try to prepare a patch.

> > for initial-exec to work:
> > 	tp + *got - add must be aligned (i.e. *got has to be set up to meet
> > 	the alignment requirement of the module, this does not seem to require
> > 	realignment of tp so even runtime loading of initial-exec tls should
> > 	be possible assuming there is enough space etc...)
> 
> There's never space so it's not even a question, but even if there
> were, no, it can't be done because tp will not be aligned mod some
> possibly-larger alignment than the alignment in effect at the time the
> thread was created.
> 

ah right because there are many threads with different tp
so tp + *got can only be aligned if tp modulo alignment
is the same in all of them.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-06-01  9:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-25 12:40 TLS issue on aarch64 Phillip Berndt
2018-05-25 14:50 ` Szabolcs Nagy
2018-05-25 21:38   ` Szabolcs Nagy
2018-05-25 22:20   ` Phillip Berndt
2018-05-26  0:54     ` Szabolcs Nagy
2018-05-26  8:24       ` Phillip Berndt
2018-05-27  0:34       ` Rich Felker
2018-05-28 20:47         ` Szabolcs Nagy
2018-05-28 22:15           ` Rich Felker
2018-05-29  6:33             ` Szabolcs Nagy
2018-05-31 15:22               ` Phillip Berndt
2018-06-01  0:11               ` Szabolcs Nagy
2018-06-01  0:52                 ` Rich Felker
2018-06-01  9:38                   ` Szabolcs Nagy
2018-05-27  0:17   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).