mailing list of musl libc
 help / color / mirror / code / Atom feed
* simplification of __aeabi_read_tp
@ 2017-08-31 22:26 Jörg Mische
  2017-08-31 22:52 ` Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: Jörg Mische @ 2017-08-31 22:26 UTC (permalink / raw)
  To: musl

Hi,

I am trying to adapt the ARM assembler parts to ARMv6-M (the thumb2 
subset of the Cortex-M0) without breaking ARMv4T compatibility. One 
issue is the function __aeabi_read_tp(), which may not clobber any 
registers except the return value in r0.

Register saving code that avoids "pop {lr}" (which is not supported by 
ARMv6-M) and "pop {pc}" (which is not supported by ARMv4T) is very ugly, 
therefore I took a closer look at its internals and discovered the 
following:

__aeabi_read_tp() calls __aeabi_read_tp_c() which inlines the function 
__pthread_self(). With ARMv7 and above, __pthread_self() simply reads 
the coprocessor register c13 without clobbering any registers. Below 
ARMv7, the function pointer __a_gettp_ptr is called. __a_gettp_ptr 
either points to __a_gettp_cp15() (a routine that reads c13) or to the 
kuser_get_tls function provided by the kernel.

The interesting point is that neither __a_gettp_cp15() (only one 
instruction and a return) nor kuser_get_tls (according to the kernel 
spec) clobber any registers. The only reason for saving the registers is 
the indirection via the C-function __aeabi_read_tp_c(), where the 
compiler is allowed to clobber r0-r3.

Since inline functions cannot be called from assembler and any C code 
must be avoided, I rewrote the code of __pthread_self() directly in 
assembler in __aeabi_read_tp.S. With these modifications the binary code 
is not only faster, it also works on the Cortex-M0 processor.

Best regards,
Jörg

---
  src/thread/arm/__aeabi_read_tp.S   | 22 ++++++++++++++++++++++
  src/thread/arm/__aeabi_read_tp.s   |  8 --------
  src/thread/arm/__aeabi_read_tp_c.c |  8 --------
  3 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/src/thread/arm/__aeabi_read_tp.S 
b/src/thread/arm/__aeabi_read_tp.S
new file mode 100644
index 0000000..897b4f8
--- /dev/null
+++ b/src/thread/arm/__aeabi_read_tp.S
@@ -0,0 +1,22 @@
+.syntax unified
+.global __a_gettp_ptr
+.hidden __a_gettp_ptr
+.global __aeabi_read_tp
+.type __aeabi_read_tp,%function
+__aeabi_read_tp:
+
+#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6ZK__) && !__thumb__) || 
__ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+
+	mrc p15,0,r0,c13,c0,3
+	bx lr
+
+#else
+
+	ldr r0,2f
+	add r0,r0,pc
+	ldr r0,[r0]
+1:	bx r0
+	.align 2
+2:	.word __a_gettp_ptr-1b
+
+#endif
diff --git a/src/thread/arm/__aeabi_read_tp.s 
b/src/thread/arm/__aeabi_read_tp.s
deleted file mode 100644
index 9d0cd31..0000000
--- a/src/thread/arm/__aeabi_read_tp.s
+++ /dev/null
@@ -1,8 +0,0 @@
-.syntax unified
-.global __aeabi_read_tp
-.type __aeabi_read_tp,%function
-__aeabi_read_tp:
-	push {r1,r2,r3,lr}
-	bl __aeabi_read_tp_c
-	pop {r1,r2,r3,lr}
-	bx lr
diff --git a/src/thread/arm/__aeabi_read_tp_c.c 
b/src/thread/arm/__aeabi_read_tp_c.c
deleted file mode 100644
index 654bdc5..0000000
--- a/src/thread/arm/__aeabi_read_tp_c.c
+++ /dev/null
@@ -1,8 +0,0 @@
-#include "pthread_impl.h"
-#include <stdint.h>
-
-__attribute__((__visibility__("hidden")))
-void *__aeabi_read_tp_c(void)
-{
-	return (void *)((uintptr_t)__pthread_self()-8+sizeof(struct pthread));
-}


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: simplification of __aeabi_read_tp
  2017-08-31 22:26 simplification of __aeabi_read_tp Jörg Mische
@ 2017-08-31 22:52 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2017-08-31 22:52 UTC (permalink / raw)
  To: musl

On Fri, Sep 01, 2017 at 12:26:05AM +0200, Jörg Mische wrote:
> Hi,
> 
> I am trying to adapt the ARM assembler parts to ARMv6-M (the thumb2
> subset of the Cortex-M0) without breaking ARMv4T compatibility. One
> issue is the function __aeabi_read_tp(), which may not clobber any
> registers except the return value in r0.
> 
> Register saving code that avoids "pop {lr}" (which is not supported
> by ARMv6-M) and "pop {pc}" (which is not supported by ARMv4T) is
> very ugly, therefore I took a closer look at its internals and
> discovered the following:
> 
> __aeabi_read_tp() calls __aeabi_read_tp_c() which inlines the
> function __pthread_self(). With ARMv7 and above, __pthread_self()
> simply reads the coprocessor register c13 without clobbering any
> registers. Below ARMv7, the function pointer __a_gettp_ptr is
> called. __a_gettp_ptr either points to __a_gettp_cp15() (a routine
> that reads c13) or to the kuser_get_tls function provided by the
> kernel.
> 
> The interesting point is that neither __a_gettp_cp15() (only one
> instruction and a return) nor kuser_get_tls (according to the kernel
> spec) clobber any registers. The only reason for saving the
> registers is the indirection via the C-function __aeabi_read_tp_c(),
> where the compiler is allowed to clobber r0-r3.
> 
> Since inline functions cannot be called from assembler and any C
> code must be avoided, I rewrote the code of __pthread_self()
> directly in assembler in __aeabi_read_tp.S. With these modifications
> the binary code is not only faster, it also works on the Cortex-M0
> processor.

If you look at the commit text for commit
29237f7f5c09c436825a7a12b68ab4143b0ebd1f which added the indirection
through C code, one of the goals was to make the arm target
fdpic-ready (to support shareable text on cortex-m).

> diff --git a/src/thread/arm/__aeabi_read_tp.S
> b/src/thread/arm/__aeabi_read_tp.S
> new file mode 100644
> index 0000000..897b4f8
> --- /dev/null
> +++ b/src/thread/arm/__aeabi_read_tp.S
> @@ -0,0 +1,22 @@
> +.syntax unified
> +.global __a_gettp_ptr
> +.hidden __a_gettp_ptr
> +.global __aeabi_read_tp
> +.type __aeabi_read_tp,%function
> +__aeabi_read_tp:
> +
> +#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6ZK__) && !__thumb__) ||
> __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> +
> +	mrc p15,0,r0,c13,c0,3
> +	bx lr
> +
> +#else
> +
> +	ldr r0,2f
> +	add r0,r0,pc
> +	ldr r0,[r0]
> +1:	bx r0
> +	.align 2
> +2:	.word __a_gettp_ptr-1b
> +
> +#endif

Here, __a_gettp_ptr-1b is not position-independent on fdpic, since
__a_gettp_ptr lies in the data segment and 1b lies in the text
segment, whose base addresses can float relative to one another. Of
course we could write an fdpic-specific version of the asm function to
access the hidden GOT pointer argument and do a GOT-relative load, but
the intent here was for the compiler to take care of all the ABI
issues.

OTOH, relying on the compiler not to generate code that could clobber
other normally-call-clobbered registers (the floating point ones) is
probably wrong, so maybe what I did here is just a bad idea and your
approach (with an extra special case for fdpic once it's added) is the
right way.

Rich


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-08-31 22:52 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31 22:26 simplification of __aeabi_read_tp Jörg Mische
2017-08-31 22:52 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).