mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Musl on Cortex-M Devices
@ 2020-12-16 23:43 Jesse DeGuire
  2020-12-17  0:23 ` Rich Felker
  0 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2020-12-16 23:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2815 bytes --]

Hey everyone,

I'm working on putting together a Clang-based toolchain to use with
Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
alternative to their paid XC32 toolchain and I'd like to use Musl as
the C library. Currently, I'm trying to get it to build for a few
different Cortex-M devices and have found that Musl builds fine for
ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
instructions not supported on the Thumb ISA subset used by v6-M and
v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
can easily switch to HEAD if needed. I am using a Python script to
build Musl (and eventually the rest of the toolchain), which you can
see on GitHub at the following link. It's a bit of a mess at the
moment, but the build_musl() function is what I'm currently using to
build Musl.

https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py

Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
Baseline and have attached a diff to this email. If you like, I can go
into more detail as to why I made the changes I made; however, many
changes were merely the result of my attempts to correct errors
reported by Clang due to it encountering instruction sequences not
supported on ARMv6-M. A number of errors were simply the result of
ARMv6-M requiring one to use the "S" variant of an instruction that
sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
files I had to change from a "lower case s" to a "capital-S" file so
that I could use macros to check for either the Thumb1 ISA
("__thumb2__ || !__thumb__") or for an M-Profile device
("!__ARM_ARCH_ISA_ARM"). The changes under
"src/thread/arm/__set_thread_area.c" are different in that I made
those because I don't believe Cortex-M devices could handle what was
there (no M-Profile device has Coprocessor 15, for example) and so I
sort of punted for now and figured that a user's fault handler could
handle those "_kuser" calls given that there probably won't be a
kernel to do so.

Unfortunately, I'm not far enough yet in my project to build code that
would actually run on one of the micros I have with me, so I haven't
yet been able to properly try out these changes. Also, I should
mention that my eventual plan is to make a "baremetal" layer that
would provide stubs to handle syscalls or thread-related things that
you wouldn't normally have on a microcontroller, but I want to get
Musl building without that first because I think I'll be able to
utilize a lot of what's already present when I get to that point.
Hopefully, what's here is still somewhat useful as a starting point
should Musl want to support Cortex-M. I'll try to update this diff as
a result of feedback or my own eventual testing (I'm a long way from
that, though!).

Thanks,
Jesse

[-- Attachment #2: musl_cortexm.diff --]
[-- Type: application/octet-stream, Size: 22666 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-16 23:43 [musl] Musl on Cortex-M Devices Jesse DeGuire
@ 2020-12-17  0:23 ` Rich Felker
  2020-12-17  4:55   ` Patrick Oppenlander
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Rich Felker @ 2020-12-17  0:23 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl

On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> Hey everyone,
> 
> I'm working on putting together a Clang-based toolchain to use with
> Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> alternative to their paid XC32 toolchain and I'd like to use Musl as
> the C library. Currently, I'm trying to get it to build for a few
> different Cortex-M devices and have found that Musl builds fine for
> ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> instructions not supported on the Thumb ISA subset used by v6-M and
> v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> can easily switch to HEAD if needed. I am using a Python script to
> build Musl (and eventually the rest of the toolchain), which you can
> see on GitHub at the following link. It's a bit of a mess at the
> moment, but the build_musl() function is what I'm currently using to
> build Musl.

I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
it is, let's have a look.

> https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> 
> Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> Baseline and have attached a diff to this email. If you like, I can go
> into more detail as to why I made the changes I made; however, many
> changes were merely the result of my attempts to correct errors
> reported by Clang due to it encountering instruction sequences not
> supported on ARMv6-M.

Are there places where clang's linker is failing to make substitutions
that the GNU one can do, that would make this simpler? For example I
know the GNU one can replace bx rn by mov pc,rn if needed (based on a
relocation the assembler emits on the insn).

> A number of errors were simply the result of
> ARMv6-M requiring one to use the "S" variant of an instruction that
> sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> files I had to change from a "lower case s" to a "capital-S" file so
> that I could use macros to check for either the Thumb1 ISA
> ("__thumb2__ || !__thumb__") or for an M-Profile device
> ("!__ARM_ARCH_ISA_ARM").

Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
If not this may need an alternate detection. But I'd like to drop as
much as possible and just make the code compatible rather than having
2 versions of it. I don't think there are any places where the
performance or size is at all relevant.

> The changes under
> "src/thread/arm/__set_thread_area.c" are different in that I made
> those because I don't believe Cortex-M devices could handle what was
> there (no M-Profile device has Coprocessor 15, for example) and so I

Unless this is an ISA level that can't be executed on a normal (non-M)
ARM profile, it still needs all the backends that might be needed and
runtime selection of which to use. This is okay. I believe Linux for
nommu ARM has a syscall for get_tp, which is rather awful but probably
needs to be added as a backend. The right way to do this would have
been with trap-and-emulate (of cp15) I think...

> sort of punted for now and figured that a user's fault handler could
> handle those "_kuser" calls given that there probably won't be a
> kernel to do so.

I don't think you can use the kuser_helper versions here because (1)
they're at ARM addresses, not thumb ones, and (2) they're in the
address range that Cortex M uses for special stuff.

> Unfortunately, I'm not far enough yet in my project to build code that
> would actually run on one of the micros I have with me, so I haven't
> yet been able to properly try out these changes. Also, I should
> mention that my eventual plan is to make a "baremetal" layer that
> would provide stubs to handle syscalls or thread-related things that
> you wouldn't normally have on a microcontroller, but I want to get
> Musl building without that first because I think I'll be able to
> utilize a lot of what's already present when I get to that point.
> Hopefully, what's here is still somewhat useful as a starting point
> should Musl want to support Cortex-M. I'll try to update this diff as
> a result of feedback or my own eventual testing (I'm a long way from
> that, though!).

Thanks for the update. Please try to attach as UTF-8 rather than
UTF-16 next time so it's more readable/accessible. I had to guess it
was UTF-16 and switch the mime encoding headers to read it.

Further comments inline below:

> diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> index 9e3937cc..21f102ad 100644
> --- a/arch/arm/atomic_arch.h
> +++ b/arch/arm/atomic_arch.h
> @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
>  
>  #ifndef a_barrier
>  #define a_barrier a_barrier
> +#if __ARM_ARCH_ISA_ARM
>  static inline void a_barrier()
>  {
>  	register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
>  	__asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
>  }
> +#else
> +static inline void a_barrier()
> +{
> +	__asm__ __volatile__ ("dmb" : : : "memory");
> +}
> +#endif
>  #endif

There's a condition above with this code already; you just need to
make it used for an additional case.

>  #define a_crash a_crash
> diff --git a/arch/arm/crt_arch.h b/arch/arm/crt_arch.h
> index 99508b1d..c48b14b0 100644
> --- a/arch/arm/crt_arch.h
> +++ b/arch/arm/crt_arch.h
> @@ -3,13 +3,25 @@ __asm__(
>  ".global " START " \n"
>  ".type " START ",%function \n"
>  START ": \n"
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  "	mov fp, #0 \n"
>  "	mov lr, #0 \n"
> +#else
> +"	movs a3, #0 \n"
> +"	mov fp, a3 \n"
> +"	mov lr, a3 \n"
> +#endif
>  "	ldr a2, 1f \n"
>  "	add a2, pc, a2 \n"
>  "	mov a1, sp \n"
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  "2:	and ip, a1, #-16 \n"
>  "	mov sp, ip \n"
> +#else
> +"2:	subs a3, #16 \n"
> +"	ands a1, a3 \n"
> +"	mov sp, a1 \n"
> +#endif
>  "	bl " START "_c \n"
>  ".weak _DYNAMIC \n"
>  ".hidden _DYNAMIC \n"

This can just use the new code unconditionally.

> diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> index e689ea21..a89fb554 100644
> --- a/arch/arm/pthread_arch.h
> +++ b/arch/arm/pthread_arch.h
> @@ -1,5 +1,5 @@
> -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)

That's probably not the right condition.

>  static inline pthread_t __pthread_self()
>  {
> diff --git a/configure b/configure
> old mode 100755
> new mode 100644
> diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> index dc020f92..83e20da2 100644
> --- a/crt/arm/crtn.S
> +++ b/crt/arm/crtn.S
> @@ -1,9 +1,17 @@
>  .syntax unified
>  
>  .section .init
> +#if __ARM_ARCH_ISA_ARM
>  	pop {r0,lr}
>  	bx lr
> +#else
> +	pop {r0,pc}
> +#endif
>  
>  .section .fini
> +#if __ARM_ARCH_ISA_ARM
>  	pop {r0,lr}
>  	bx lr
> +#else
> +	pop {r0,pc}
> +#endif

Ideally the linker could handle this.

> diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> index 3ae133c9..1843d97d 100644
> --- a/src/ldso/arm/tlsdesc.S
> +++ b/src/ldso/arm/tlsdesc.S
> @@ -12,13 +12,19 @@ __tlsdesc_static:
>  .hidden __tlsdesc_dynamic
>  .type __tlsdesc_dynamic,%function
>  __tlsdesc_dynamic:
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	push {r2,r3,ip,lr}
> +#else
> +	push {r2-r4,lr}
> +	mov r2,ip
> +	push {r2}
> +#endif

AFAICT ip is not special here. If it's a pain to push/pop, just use a
different register.

>  	ldr r1,[r0]
>  	ldr r2,[r1,#4]  // r2 = offset
>  	ldr r1,[r1]     // r1 = modid
>  
> -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
>  	mrc p15,0,r0,c13,c0,3
>  #else
>  	ldr r0,1f
> @@ -36,19 +42,34 @@ __tlsdesc_dynamic:
>  	bx r0
>  #endif
>  #endif
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	ldr r3,[r0,#-4] // r3 = dtv
>  	ldr ip,[r3,r1,LSL #2]
>  	sub r0,ip,r0
> +#else
> +	mov r4,r0
> +	subs r4,#4
> +	ldr r3,[r4]
> +	lsls r4,r1,#2
> +	ldr r4,[r3,r4]
> +	subs r0,r4,r0
> +#endif
>  	add r0,r0,r2    // r0 = r3[r1]-r0+r2
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  #if __ARM_ARCH >= 5
>  	pop {r2,r3,ip,pc}
>  #else
>  	pop {r2,r3,ip,lr}
>  	bx lr
>  #endif
> +#else
> +	pop {r2}
> +	mov ip,r2
> +	pop {r2-r4,pc}
> +#endif
>  
> -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
>  #else
>  	.align 2
>  1:	.word __a_gettp_ptr - 2b
> diff --git a/src/process/arm/vfork.s b/src/process/arm/vfork.s
> index d7ec41b3..b6f0260e 100644
> --- a/src/process/arm/vfork.s
> +++ b/src/process/arm/vfork.s
> @@ -3,7 +3,7 @@
>  .type vfork,%function
>  vfork:
>  	mov ip, r7
> -	mov r7, 190
> +	movs r7, 190
>  	svc 0
>  	mov r7, ip
>  	.hidden __syscall_ret

OK.

> diff --git a/src/setjmp/arm/longjmp.S b/src/setjmp/arm/longjmp.S
> index 8df0b819..745c9ba2 100644
> --- a/src/setjmp/arm/longjmp.S
> +++ b/src/setjmp/arm/longjmp.S
> @@ -5,6 +5,7 @@
>  .type longjmp,%function
>  _longjmp:
>  longjmp:
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	mov ip,r0
>  	movs r0,r1
>  	moveq r0,#1
> @@ -16,7 +17,7 @@ longjmp:
>  	ldr r2,1f
>  	ldr r1,[r1,r2]
>  
> -#if __ARM_ARCH < 8
> +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
>  	tst r1,#0x260
>  	beq 3f
>  	// HWCAP_ARM_FPA
> @@ -31,7 +32,7 @@ longjmp:
>  	.fpu softvfp
>  	.eabi_attribute 10, 0
>  	.eabi_attribute 27, 0
> -#if __ARM_ARCH < 8
> +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
>  	// HWCAP_ARM_IWMMXT
>  2:	tst r1,#0x200
>  	beq 3f
> @@ -42,6 +43,36 @@ longjmp:
>  	ldcl p1, cr14, [ip], #8
>  	ldcl p1, cr15, [ip], #8
>  #endif
> +#else
> +	mov ip,r0
> +	movs r0,r1
> +	bne 4f
> +	movs r0,#1
> +4:	mov r1,ip
> +	adds r1,#16
> +	ldmia r1!, {r2-r7}
> +	mov lr,r7
> +	mov sp,r6
> +	mov r11,r5
> +	mov r10,r4
> +	mov r9,r3
> +	mov r8,r2
> +	mov ip,r1
> +	subs r1,#40
> +	ldmia r1!, {r4-r7}
> +
> +	adr r1,1f
> +	ldr r2,1f
> +	ldr r1,[r1,r2]
> +	movs r2,#0x40
> +	tst r1,r2
> +	beq 2f
> +	.fpu vfp
> +	vldmia ip!, {d8-d15}
> +	.fpu softvfp
> +	.eabi_attribute 10, 0
> +	.eabi_attribute 27, 0
> +#endif
>  2:
>  3:	bx lr

This should probably be unified somehow.

> diff --git a/src/setjmp/arm/setjmp.S b/src/setjmp/arm/setjmp.S
> index 45731d22..440c166d 100644
> --- a/src/setjmp/arm/setjmp.S
> +++ b/src/setjmp/arm/setjmp.S
> @@ -8,6 +8,7 @@
>  __setjmp:
>  _setjmp:
>  setjmp:
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	mov ip,r0
>  	stmia ip!,{v1,v2,v3,v4,v5,v6,sl,fp}
>  	mov r2,sp
> @@ -18,7 +19,7 @@ setjmp:
>  	ldr r2,1f
>  	ldr r1,[r1,r2]
>  
> -#if __ARM_ARCH < 8
> +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
>  	tst r1,#0x260
>  	beq 3f
>  	// HWCAP_ARM_FPA
> @@ -33,7 +34,7 @@ setjmp:
>  	.fpu softvfp
>  	.eabi_attribute 10, 0
>  	.eabi_attribute 27, 0
> -#if __ARM_ARCH < 8
> +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
>  	// HWCAP_ARM_IWMMXT
>  2:	tst r1,#0x200
>  	beq 3f
> @@ -44,6 +45,30 @@ setjmp:
>  	stcl p1, cr14, [ip], #8
>  	stcl p1, cr15, [ip], #8
>  #endif
> +#else
> +	stmia r0!,{r4-r7}
> +	mov r1,r8
> +	mov r2,r9
> +	mov r3,r10
> +	mov r4,r11
> +	mov r5,sp
> +	mov r6,lr
> +	stmia r0!,{r1-r6}
> +	mov ip,r0
> +	movs r0,#0
> +
> +	adr r1,1f
> +	ldr r2,1f
> +	ldr r1,[r1,r2]
> +	movs r2,#0x40
> +	tst r1,r2
> +	beq 2f
> +	.fpu vfp
> +	vstmia ip!, {d8-d15}
> +	.fpu softvfp
> +	.eabi_attribute 10, 0
> +	.eabi_attribute 27, 0
> +#endif
>  2:
>  3:	bx lr

Same here.

> diff --git a/src/signal/arm/restore.s b/src/signal/arm/restore.s
> index fb086d9b..2b7621b1 100644
> --- a/src/signal/arm/restore.s
> +++ b/src/signal/arm/restore.s
> @@ -4,12 +4,12 @@
>  .hidden __restore
>  .type __restore,%function
>  __restore:
> -	mov r7,#119
> +	movs r7,#119
>  	swi 0x0
>  
>  .global __restore_rt
>  .hidden __restore_rt
>  .type __restore_rt,%function
>  __restore_rt:
> -	mov r7,#173
> +	movs r7,#173
>  	swi 0x0

Probably OK, unles gdb does something ridiculous wanting the exact
insn sequence...

> diff --git a/src/signal/arm/sigsetjmp.S b/src/signal/arm/sigsetjmp.S
> index 69ebbf49..85a71666 100644
> --- a/src/signal/arm/sigsetjmp.S
> +++ b/src/signal/arm/sigsetjmp.S
> @@ -9,16 +9,36 @@ __sigsetjmp:
>  	bne 1f
>  	b setjmp
>  
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  1:	str lr,[r0,#256]
>  	str r4,[r0,#260+8]
> +#else
> +1:	mov ip,r2
> +	mov r2,lr
> +	adds r0,#200
> +	str r2,[r0,#56]
> +	str r4,[r0,#60+8]
> +	subs r0,#200
> +	mov r2,ip
> +#endif
>  	mov r4,r0
>  
>  	bl setjmp
>  
>  	mov r1,r0
>  	mov r0,r4
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	ldr lr,[r0,#256]
>  	ldr r4,[r0,#260+8]
> +#else
> +	mov ip,r2
> +	adds r0,#200
> +	ldr r2,[r0,#56]
> +	mov lr,r2
> +	ldr r4,[r0,#60+8]
> +	subs r0,#200
> +	mov r2,ip
> +#endif

Ideally should be unified.

>  .hidden __sigsetjmp_tail
>  	b __sigsetjmp_tail
> diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> index 869e3448..c5d17533 100644
> --- a/src/string/arm/memcpy.S
> +++ b/src/string/arm/memcpy.S
> @@ -43,6 +43,8 @@
>   * building as thumb 2 and big-endian.
>   */
>  
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
> +
>  .syntax unified
>  
>  .global memcpy
> @@ -477,3 +479,4 @@ copy_last_3_and_return:
>  	ldmfd   sp!, {r0, r4, lr}
>  	bx      lr
>  
> +#endif // defined(__thumb2__)  ||  !defined(__thumb__)

The logic to enable the C memcpy seems to be missing.

> diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> index 09de65aa..cb5d757d 100644
> --- a/src/thread/arm/__set_thread_area.c
> +++ b/src/thread/arm/__set_thread_area.c
> @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
>  
>  int __set_thread_area(void *p)
>  {
> -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> +#if !__ARM_ARCH_ISA_ARM
> +	__a_gettp_ptr = __a_gettp_kuser;
> +	__a_cas_ptr = __a_cas_kuser;
> +	__a_barrier_ptr = __a_barrier_kuser;
> +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
>  	if (__hwcap & HWCAP_TLS) {
>  		size_t *aux;
>  		__a_cas_ptr = __a_cas_v7;

HWCAP needs to be honored here. You also need to add the new logic to
use the syscall for getting thread pointer, I think.

> diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> index 29c2d07b..fe0406e3 100644
> --- a/src/thread/arm/__unmapself.s
> +++ b/src/thread/arm/__unmapself.s
> @@ -3,7 +3,7 @@
>  .global __unmapself
>  .type   __unmapself,%function
>  __unmapself:
> -	mov r7,#91
> +	movs r7,#91
>  	svc 0
> -	mov r7,#1
> +	movs r7,#1
>  	svc 0

OK, but this file can't be used on nommu anyway. Instead the generic C
version has to be used.

> diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> index da50508d..d49809d9 100644
> --- a/src/thread/arm/atomics.S
> +++ b/src/thread/arm/atomics.S
> @@ -1,3 +1,5 @@
> +#if __ARM_ARCH_ISA_ARM
> +
>  .syntax unified
>  .text
>  
> @@ -104,3 +106,5 @@ __a_cas_ptr:
>  .hidden __a_gettp_ptr
>  __a_gettp_ptr:
>  	.word __a_gettp_cp15
> +
> +#endif

This file should not be conditional. If it can't be assembled in
default mode it just needs ISA level override directives. Which
functions are actually called is selected at runtime.

> diff --git a/src/thread/arm/clone.S b/src/thread/arm/clone.S
> index bb0965da..70d29f70 100644
> --- a/src/thread/arm/clone.S
> +++ b/src/thread/arm/clone.S
> @@ -4,24 +4,30 @@
>  .hidden __clone
>  .type   __clone,%function
>  __clone:
> -	stmfd sp!,{r4,r5,r6,r7}
> -	mov r7,#120
> +	push {r4,r5,r6,r7}
> +	movs r7,#120
>  	mov r6,r3
>  	mov r5,r0
>  	mov r0,r2
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	and r1,r1,#-16
> +#else
> +	movs r2,#0
> +	subs r2,#16
> +	ands r1,r2
> +#endif
>  	ldr r2,[sp,#16]
>  	ldr r3,[sp,#20]
>  	ldr r4,[sp,#24]
>  	svc 0
>  	tst r0,r0
>  	beq 1f
> -	ldmfd sp!,{r4,r5,r6,r7}
> +	pop {r4,r5,r6,r7}
>  	bx lr
>  
>  1:	mov r0,r6
>  	bl 3f
> -2:	mov r7,#1
> +2:	movs r7,#1
>  	svc 0
>  	b 2b

I think this can all be unconditional, no #if's.

> diff --git a/src/thread/arm/syscall_cp.S b/src/thread/arm/syscall_cp.S
> index e607dd42..f53cbb3a 100644
> --- a/src/thread/arm/syscall_cp.S
> +++ b/src/thread/arm/syscall_cp.S
> @@ -11,7 +11,7 @@
>  .type __syscall_cp_asm,%function
>  __syscall_cp_asm:
>  	mov ip,sp
> -	stmfd sp!,{r4,r5,r6,r7}
> +	push {r4,r5,r6,r7}
>  __cp_begin:
>  	ldr r0,[r0]
>  	cmp r0,#0
> @@ -19,11 +19,16 @@ __cp_begin:
>  	mov r7,r1
>  	mov r0,r2
>  	mov r1,r3
> +#if defined(__thumb2__)  ||  !defined(__thumb__)
>  	ldmfd ip,{r2,r3,r4,r5,r6}
> +#else
> +	mov r2,ip
> +	ldmfd r2,{r2,r3,r4,r5,r6}
> +#endif
>  	svc 0
>  __cp_end:
> -	ldmfd sp!,{r4,r5,r6,r7}
> +	pop {r4,r5,r6,r7}
>  	bx lr
>  __cp_cancel:
> -	ldmfd sp!,{r4,r5,r6,r7}
> +	pop {r4,r5,r6,r7}
>  	b __cancel
> diff --git a/tools/install.sh b/tools/install.sh
> old mode 100755
> new mode 100644

This can be unconditional, and maybe a register other than ip can be
used to make it simpler..?

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-17  0:23 ` Rich Felker
@ 2020-12-17  4:55   ` Patrick Oppenlander
  2020-12-17  5:10     ` Patrick Oppenlander
  2020-12-18  7:15   ` Jesse DeGuire
  2020-12-18 19:38   ` Markus Wichmann
  2 siblings, 1 reply; 17+ messages in thread
From: Patrick Oppenlander @ 2020-12-17  4:55 UTC (permalink / raw)
  To: musl; +Cc: Jesse DeGuire

On Thu, Dec 17, 2020 at 11:24 AM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > Hey everyone,
> >
> > I'm working on putting together a Clang-based toolchain to use with
> > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > the C library. Currently, I'm trying to get it to build for a few
> > different Cortex-M devices and have found that Musl builds fine for
> > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > instructions not supported on the Thumb ISA subset used by v6-M and
> > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > can easily switch to HEAD if needed. I am using a Python script to
> > build Musl (and eventually the rest of the toolchain), which you can
> > see on GitHub at the following link. It's a bit of a mess at the
> > moment, but the build_musl() function is what I'm currently using to
> > build Musl.
>
> I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> it is, let's have a look.
>
> > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> >
> > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > Baseline and have attached a diff to this email. If you like, I can go
> > into more detail as to why I made the changes I made; however, many
> > changes were merely the result of my attempts to correct errors
> > reported by Clang due to it encountering instruction sequences not
> > supported on ARMv6-M.
>
> Are there places where clang's linker is failing to make substitutions
> that the GNU one can do, that would make this simpler? For example I
> know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> relocation the assembler emits on the insn).
>
> > A number of errors were simply the result of
> > ARMv6-M requiring one to use the "S" variant of an instruction that
> > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > files I had to change from a "lower case s" to a "capital-S" file so
> > that I could use macros to check for either the Thumb1 ISA
> > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > ("!__ARM_ARCH_ISA_ARM").
>
> Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> If not this may need an alternate detection. But I'd like to drop as
> much as possible and just make the code compatible rather than having
> 2 versions of it. I don't think there are any places where the
> performance or size is at all relevant.
>
> > The changes under
> > "src/thread/arm/__set_thread_area.c" are different in that I made
> > those because I don't believe Cortex-M devices could handle what was
> > there (no M-Profile device has Coprocessor 15, for example) and so I
>
> Unless this is an ISA level that can't be executed on a normal (non-M)
> ARM profile, it still needs all the backends that might be needed and
> runtime selection of which to use. This is okay. I believe Linux for
> nommu ARM has a syscall for get_tp, which is rather awful but probably
> needs to be added as a backend. The right way to do this would have
> been with trap-and-emulate (of cp15) I think...

Linux emulates mrc 15 on old -A cores but they decided not to on -M
for some reason. BTW, the syscall is called get_tls.

Is there any option other than supporting the get_tls syscall? Even if
someone puts in the effort to wire up the trap-and-emulate backend,
musl linked executables will still only run on new kernels.

I took the trap-and-emulate approach in Apex RTOS to avoid opening
this can of worms. It's the only missing link for musl on armv7-m.
Everything else works beautifully.

Patrick

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-17  4:55   ` Patrick Oppenlander
@ 2020-12-17  5:10     ` Patrick Oppenlander
  2020-12-18  7:17       ` Jesse DeGuire
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Oppenlander @ 2020-12-17  5:10 UTC (permalink / raw)
  To: musl; +Cc: Jesse DeGuire

On Thu, Dec 17, 2020 at 3:55 PM Patrick Oppenlander
<patrick.oppenlander@gmail.com> wrote:
>
> On Thu, Dec 17, 2020 at 11:24 AM Rich Felker <dalias@libc.org> wrote:
> >
> > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > Hey everyone,
> > >
> > > I'm working on putting together a Clang-based toolchain to use with
> > > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > > the C library. Currently, I'm trying to get it to build for a few
> > > different Cortex-M devices and have found that Musl builds fine for
> > > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > > instructions not supported on the Thumb ISA subset used by v6-M and
> > > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > > can easily switch to HEAD if needed. I am using a Python script to
> > > build Musl (and eventually the rest of the toolchain), which you can
> > > see on GitHub at the following link. It's a bit of a mess at the
> > > moment, but the build_musl() function is what I'm currently using to
> > > build Musl.
> >
> > I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> > it is, let's have a look.
> >
> > > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> > >
> > > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > > Baseline and have attached a diff to this email. If you like, I can go
> > > into more detail as to why I made the changes I made; however, many
> > > changes were merely the result of my attempts to correct errors
> > > reported by Clang due to it encountering instruction sequences not
> > > supported on ARMv6-M.
> >
> > Are there places where clang's linker is failing to make substitutions
> > that the GNU one can do, that would make this simpler? For example I
> > know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> > relocation the assembler emits on the insn).
> >
> > > A number of errors were simply the result of
> > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > files I had to change from a "lower case s" to a "capital-S" file so
> > > that I could use macros to check for either the Thumb1 ISA
> > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > ("!__ARM_ARCH_ISA_ARM").
> >
> > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > If not this may need an alternate detection. But I'd like to drop as
> > much as possible and just make the code compatible rather than having
> > 2 versions of it. I don't think there are any places where the
> > performance or size is at all relevant.
> >
> > > The changes under
> > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > those because I don't believe Cortex-M devices could handle what was
> > > there (no M-Profile device has Coprocessor 15, for example) and so I
> >
> > Unless this is an ISA level that can't be executed on a normal (non-M)
> > ARM profile, it still needs all the backends that might be needed and
> > runtime selection of which to use. This is okay. I believe Linux for
> > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > needs to be added as a backend. The right way to do this would have
> > been with trap-and-emulate (of cp15) I think...
>
> Linux emulates mrc 15 on old -A cores but they decided not to on -M
> for some reason. BTW, the syscall is called get_tls.
>
> Is there any option other than supporting the get_tls syscall? Even if
> someone puts in the effort to wire up the trap-and-emulate backend,
> musl linked executables will still only run on new kernels.
>
> I took the trap-and-emulate approach in Apex RTOS to avoid opening
> this can of worms. It's the only missing link for musl on armv7-m.
> Everything else works beautifully.

Another consideration is qemu-user: Currently it aborts when
encountering an mrc 15 instruction while emulating armv7-m. I guess
that would probably also be solved by supporting the syscall.

Patrick

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-17  0:23 ` Rich Felker
  2020-12-17  4:55   ` Patrick Oppenlander
@ 2020-12-18  7:15   ` Jesse DeGuire
  2020-12-18 17:30     ` Rich Felker
  2020-12-18 19:38   ` Markus Wichmann
  2 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2020-12-18  7:15 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 23878 bytes --]

On Wed, Dec 16, 2020 at 7:23 PM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > Hey everyone,
> >
> > I'm working on putting together a Clang-based toolchain to use with
> > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > the C library. Currently, I'm trying to get it to build for a few
> > different Cortex-M devices and have found that Musl builds fine for
> > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > instructions not supported on the Thumb ISA subset used by v6-M and
> > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > can easily switch to HEAD if needed. I am using a Python script to
> > build Musl (and eventually the rest of the toolchain), which you can
> > see on GitHub at the following link. It's a bit of a mess at the
> > moment, but the build_musl() function is what I'm currently using to
> > build Musl.
>
> I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> it is, let's have a look.
>
> > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> >
> > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > Baseline and have attached a diff to this email. If you like, I can go
> > into more detail as to why I made the changes I made; however, many
> > changes were merely the result of my attempts to correct errors
> > reported by Clang due to it encountering instruction sequences not
> > supported on ARMv6-M.
>
> Are there places where clang's linker is failing to make substitutions
> that the GNU one can do, that would make this simpler? For example I
> know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> relocation the assembler emits on the insn).

The main issue I ran into is that ARMv6-M and v8-M.base have many
instructions that can accept only registers R0-R7 as input, so some
extra care or shuffling is required. Many of these were pretty easily
worked around.

> > A number of errors were simply the result of
> > ARMv6-M requiring one to use the "S" variant of an instruction that
> > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > files I had to change from a "lower case s" to a "capital-S" file so
> > that I could use macros to check for either the Thumb1 ISA
> > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > ("!__ARM_ARCH_ISA_ARM").
>
> Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> If not this may need an alternate detection. But I'd like to drop as
> much as possible and just make the code compatible rather than having
> 2 versions of it. I don't think there are any places where the
> performance or size is at all relevant.

I hadn't thought enough about that until your reply, so I used
Compiler Explorer to at least get a general idea of when that macro
was supported. Clang 3.5 supports it, but Clang 3.4.1 does not. GCC
5.4 supports it, but 4.64 (the previous version available on Compiler
Explorer) does not. With that, I tried to come up with something to
detect M-Profile devices and have updated my changes with it.

   #if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ \
     || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)

This presumes that the __ARM_ARCH_xx__ macros were supported for as
long as that variant was supported and that any compiler that supports
an ARM arch greater than 7 is new enough to have the
__ARM_ARCH_ISA_ARM macro, but this should cover a wider range of
toolchains.

I also did merge some of the alternate paths like you suggested.

> > The changes under
> > "src/thread/arm/__set_thread_area.c" are different in that I made
> > those because I don't believe Cortex-M devices could handle what was
> > there (no M-Profile device has Coprocessor 15, for example) and so I
>
> Unless this is an ISA level that can't be executed on a normal (non-M)
> ARM profile, it still needs all the backends that might be needed and
> runtime selection of which to use. This is okay. I believe Linux for
> nommu ARM has a syscall for get_tp, which is rather awful but probably
> needs to be added as a backend. The right way to do this would have
> been with trap-and-emulate (of cp15) I think...

Out of curiosity, what makes that syscall awful?

> > sort of punted for now and figured that a user's fault handler could
> > handle those "_kuser" calls given that there probably won't be a
> > kernel to do so.
>
> I don't think you can use the kuser_helper versions here because (1)
> they're at ARM addresses, not thumb ones, and (2) they're in the
> address range that Cortex M uses for special stuff.

Ah, right you are! It turns out that the range 0xE0100000 and above is
a "vendor system region", presumably for vendors to add their own
system control or debug features on top of what ARM already provides
and so one cannot assume that the region is empty.

> > Unfortunately, I'm not far enough yet in my project to build code that
> > would actually run on one of the micros I have with me, so I haven't
> > yet been able to properly try out these changes. Also, I should
> > mention that my eventual plan is to make a "baremetal" layer that
> > would provide stubs to handle syscalls or thread-related things that
> > you wouldn't normally have on a microcontroller, but I want to get
> > Musl building without that first because I think I'll be able to
> > utilize a lot of what's already present when I get to that point.
> > Hopefully, what's here is still somewhat useful as a starting point
> > should Musl want to support Cortex-M. I'll try to update this diff as
> > a result of feedback or my own eventual testing (I'm a long way from
> > that, though!).
>
> Thanks for the update. Please try to attach as UTF-8 rather than
> UTF-16 next time so it's more readable/accessible. I had to guess it
> was UTF-16 and switch the mime encoding headers to read it.

Oops, sorry about that. I must have accidentally created the diff with
PowerShell instead of my normal WSL Bash terminal.

> Further comments inline below:
>
> > diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> > index 9e3937cc..21f102ad 100644
> > --- a/arch/arm/atomic_arch.h
> > +++ b/arch/arm/atomic_arch.h
> > @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
> >
> >  #ifndef a_barrier
> >  #define a_barrier a_barrier
> > +#if __ARM_ARCH_ISA_ARM
> >  static inline void a_barrier()
> >  {
> >       register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
> >       __asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
> >  }
> > +#else
> > +static inline void a_barrier()
> > +{
> > +     __asm__ __volatile__ ("dmb" : : : "memory");
> > +}
> > +#endif
> >  #endif
>
> There's a condition above with this code already; you just need to
> make it used for an additional case.

Fixed. The original code was nested inside an ifdef that also would
have used LDREX and STREX--neither of which ARMv6-M supports--so I
moved that out and added the extra check.

> >  #define a_crash a_crash
> > diff --git a/arch/arm/crt_arch.h b/arch/arm/crt_arch.h
> > index 99508b1d..c48b14b0 100644
> > --- a/arch/arm/crt_arch.h
> > +++ b/arch/arm/crt_arch.h
> > @@ -3,13 +3,25 @@ __asm__(
> >  ".global " START " \n"
> >  ".type " START ",%function \n"
> >  START ": \n"
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >  "    mov fp, #0 \n"
> >  "    mov lr, #0 \n"
> > +#else
> > +"    movs a3, #0 \n"
> > +"    mov fp, a3 \n"
> > +"    mov lr, a3 \n"
> > +#endif
> >  "    ldr a2, 1f \n"
> >  "    add a2, pc, a2 \n"
> >  "    mov a1, sp \n"
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >  "2:  and ip, a1, #-16 \n"
> >  "    mov sp, ip \n"
> > +#else
> > +"2:  subs a3, #16 \n"
> > +"    ands a1, a3 \n"
> > +"    mov sp, a1 \n"
> > +#endif
> >  "    bl " START "_c \n"
> >  ".weak _DYNAMIC \n"
> >  ".hidden _DYNAMIC \n"
>
> This can just use the new code unconditionally.

Done.

> > diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> > index e689ea21..a89fb554 100644
> > --- a/arch/arm/pthread_arch.h
> > +++ b/arch/arm/pthread_arch.h
> > @@ -1,5 +1,5 @@
> > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
>
> That's probably not the right condition.

Can you provide more detail as I'm missing what you mean? The intent
was to prevent this check from succeeding for any M-Profile device so
that they fall back to the __a gettp_ptr solution. I did update this
check to what I described at the top of this reply.

> >  static inline pthread_t __pthread_self()
> >  {
> > diff --git a/configure b/configure
> > old mode 100755
> > new mode 100644
> > diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> > index dc020f92..83e20da2 100644
> > --- a/crt/arm/crtn.S
> > +++ b/crt/arm/crtn.S
> > @@ -1,9 +1,17 @@
> >  .syntax unified
> >
> >  .section .init
> > +#if __ARM_ARCH_ISA_ARM
> >       pop {r0,lr}
> >       bx lr
> > +#else
> > +     pop {r0,pc}
> > +#endif
> >
> >  .section .fini
> > +#if __ARM_ARCH_ISA_ARM
> >       pop {r0,lr}
> >       bx lr
> > +#else
> > +     pop {r0,pc}
> > +#endif
>
> Ideally the linker could handle this.

The issue here is that ARMv6-M (and I believe v8-M.base) cannot POP
into LR. Can this sequence replace LR with R1? My current version POPs
into R1 and moves that into LR because I wasn't sure if that was
important for the caller.

> > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > index 3ae133c9..1843d97d 100644
> > --- a/src/ldso/arm/tlsdesc.S
> > +++ b/src/ldso/arm/tlsdesc.S
> > @@ -12,13 +12,19 @@ __tlsdesc_static:
> >  .hidden __tlsdesc_dynamic
> >  .type __tlsdesc_dynamic,%function
> >  __tlsdesc_dynamic:
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       push {r2,r3,ip,lr}
> > +#else
> > +     push {r2-r4,lr}
> > +     mov r2,ip
> > +     push {r2}
> > +#endif
>
> AFAICT ip is not special here. If it's a pain to push/pop, just use a
> different register.

Gotcha. It looks like there isn't really a good reason for the
original to use IP, either, so I could change that and merge a couple
of those alternate code paths. If you don't mind a few extra
instructions, I think I can get rid of all of the (__thumb2__ ||
!__thumb__) checks in there.

> >       ldr r1,[r0]
> >       ldr r2,[r1,#4]  // r2 = offset
> >       ldr r1,[r1]     // r1 = modid
> >
> > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> >       mrc p15,0,r0,c13,c0,3
> >  #else
> >       ldr r0,1f
> > @@ -36,19 +42,34 @@ __tlsdesc_dynamic:
> >       bx r0
> >  #endif
> >  #endif
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       ldr r3,[r0,#-4] // r3 = dtv
> >       ldr ip,[r3,r1,LSL #2]
> >       sub r0,ip,r0
> > +#else
> > +     mov r4,r0
> > +     subs r4,#4
> > +     ldr r3,[r4]
> > +     lsls r4,r1,#2
> > +     ldr r4,[r3,r4]
> > +     subs r0,r4,r0
> > +#endif
> >       add r0,r0,r2    // r0 = r3[r1]-r0+r2
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >  #if __ARM_ARCH >= 5
> >       pop {r2,r3,ip,pc}
> >  #else
> >       pop {r2,r3,ip,lr}
> >       bx lr
> >  #endif
> > +#else
> > +     pop {r2}
> > +     mov ip,r2
> > +     pop {r2-r4,pc}
> > +#endif
> >
> > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> >  #else
> >       .align 2
> >  1:   .word __a_gettp_ptr - 2b
> > diff --git a/src/process/arm/vfork.s b/src/process/arm/vfork.s
> > index d7ec41b3..b6f0260e 100644
> > --- a/src/process/arm/vfork.s
> > +++ b/src/process/arm/vfork.s
> > @@ -3,7 +3,7 @@
> >  .type vfork,%function
> >  vfork:
> >       mov ip, r7
> > -     mov r7, 190
> > +     movs r7, 190
> >       svc 0
> >       mov r7, ip
> >       .hidden __syscall_ret
>
> OK.
>
> > diff --git a/src/setjmp/arm/longjmp.S b/src/setjmp/arm/longjmp.S
> > index 8df0b819..745c9ba2 100644
> > --- a/src/setjmp/arm/longjmp.S
> > +++ b/src/setjmp/arm/longjmp.S
> > @@ -5,6 +5,7 @@
> >  .type longjmp,%function
> >  _longjmp:
> >  longjmp:
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       mov ip,r0
> >       movs r0,r1
> >       moveq r0,#1
> > @@ -16,7 +17,7 @@ longjmp:
> >       ldr r2,1f
> >       ldr r1,[r1,r2]
> >
> > -#if __ARM_ARCH < 8
> > +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
> >       tst r1,#0x260
> >       beq 3f
> >       // HWCAP_ARM_FPA
> > @@ -31,7 +32,7 @@ longjmp:
> >       .fpu softvfp
> >       .eabi_attribute 10, 0
> >       .eabi_attribute 27, 0
> > -#if __ARM_ARCH < 8
> > +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
> >       // HWCAP_ARM_IWMMXT
> >  2:   tst r1,#0x200
> >       beq 3f
> > @@ -42,6 +43,36 @@ longjmp:
> >       ldcl p1, cr14, [ip], #8
> >       ldcl p1, cr15, [ip], #8
> >  #endif
> > +#else
> > +     mov ip,r0
> > +     movs r0,r1
> > +     bne 4f
> > +     movs r0,#1
> > +4:   mov r1,ip
> > +     adds r1,#16
> > +     ldmia r1!, {r2-r7}
> > +     mov lr,r7
> > +     mov sp,r6
> > +     mov r11,r5
> > +     mov r10,r4
> > +     mov r9,r3
> > +     mov r8,r2
> > +     mov ip,r1
> > +     subs r1,#40
> > +     ldmia r1!, {r4-r7}
> > +
> > +     adr r1,1f
> > +     ldr r2,1f
> > +     ldr r1,[r1,r2]
> > +     movs r2,#0x40
> > +     tst r1,r2
> > +     beq 2f
> > +     .fpu vfp
> > +     vldmia ip!, {d8-d15}
> > +     .fpu softvfp
> > +     .eabi_attribute 10, 0
> > +     .eabi_attribute 27, 0
> > +#endif
> >  2:
> >  3:   bx lr
>
> This should probably be unified somehow.

I was able to partially merge this.

> > diff --git a/src/setjmp/arm/setjmp.S b/src/setjmp/arm/setjmp.S
> > index 45731d22..440c166d 100644
> > --- a/src/setjmp/arm/setjmp.S
> > +++ b/src/setjmp/arm/setjmp.S
> > @@ -8,6 +8,7 @@
> >  __setjmp:
> >  _setjmp:
> >  setjmp:
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       mov ip,r0
> >       stmia ip!,{v1,v2,v3,v4,v5,v6,sl,fp}
> >       mov r2,sp
> > @@ -18,7 +19,7 @@ setjmp:
> >       ldr r2,1f
> >       ldr r1,[r1,r2]
> >
> > -#if __ARM_ARCH < 8
> > +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
> >       tst r1,#0x260
> >       beq 3f
> >       // HWCAP_ARM_FPA
> > @@ -33,7 +34,7 @@ setjmp:
> >       .fpu softvfp
> >       .eabi_attribute 10, 0
> >       .eabi_attribute 27, 0
> > -#if __ARM_ARCH < 8
> > +#if __ARM_ARCH_ISA_ARM  &&  __ARM_ARCH < 8
> >       // HWCAP_ARM_IWMMXT
> >  2:   tst r1,#0x200
> >       beq 3f
> > @@ -44,6 +45,30 @@ setjmp:
> >       stcl p1, cr14, [ip], #8
> >       stcl p1, cr15, [ip], #8
> >  #endif
> > +#else
> > +     stmia r0!,{r4-r7}
> > +     mov r1,r8
> > +     mov r2,r9
> > +     mov r3,r10
> > +     mov r4,r11
> > +     mov r5,sp
> > +     mov r6,lr
> > +     stmia r0!,{r1-r6}
> > +     mov ip,r0
> > +     movs r0,#0
> > +
> > +     adr r1,1f
> > +     ldr r2,1f
> > +     ldr r1,[r1,r2]
> > +     movs r2,#0x40
> > +     tst r1,r2
> > +     beq 2f
> > +     .fpu vfp
> > +     vstmia ip!, {d8-d15}
> > +     .fpu softvfp
> > +     .eabi_attribute 10, 0
> > +     .eabi_attribute 27, 0
> > +#endif
> >  2:
> >  3:   bx lr
>
> Same here.

I partially merged this as well.

> > diff --git a/src/signal/arm/restore.s b/src/signal/arm/restore.s
> > index fb086d9b..2b7621b1 100644
> > --- a/src/signal/arm/restore.s
> > +++ b/src/signal/arm/restore.s
> > @@ -4,12 +4,12 @@
> >  .hidden __restore
> >  .type __restore,%function
> >  __restore:
> > -     mov r7,#119
> > +     movs r7,#119
> >       swi 0x0
> >
> >  .global __restore_rt
> >  .hidden __restore_rt
> >  .type __restore_rt,%function
> >  __restore_rt:
> > -     mov r7,#173
> > +     movs r7,#173
> >       swi 0x0
>
> Probably OK, unles gdb does something ridiculous wanting the exact
> insn sequence...
>
> > diff --git a/src/signal/arm/sigsetjmp.S b/src/signal/arm/sigsetjmp.S
> > index 69ebbf49..85a71666 100644
> > --- a/src/signal/arm/sigsetjmp.S
> > +++ b/src/signal/arm/sigsetjmp.S
> > @@ -9,16 +9,36 @@ __sigsetjmp:
> >       bne 1f
> >       b setjmp
> >
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >  1:   str lr,[r0,#256]
> >       str r4,[r0,#260+8]
> > +#else
> > +1:   mov ip,r2
> > +     mov r2,lr
> > +     adds r0,#200
> > +     str r2,[r0,#56]
> > +     str r4,[r0,#60+8]
> > +     subs r0,#200
> > +     mov r2,ip
> > +#endif
> >       mov r4,r0
> >
> >       bl setjmp
> >
> >       mov r1,r0
> >       mov r0,r4
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       ldr lr,[r0,#256]
> >       ldr r4,[r0,#260+8]
> > +#else
> > +     mov ip,r2
> > +     adds r0,#200
> > +     ldr r2,[r0,#56]
> > +     mov lr,r2
> > +     ldr r4,[r0,#60+8]
> > +     subs r0,#200
> > +     mov r2,ip
> > +#endif
>
> Ideally should be unified.

Done.

> >  .hidden __sigsetjmp_tail
> >       b __sigsetjmp_tail
> > diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> > index 869e3448..c5d17533 100644
> > --- a/src/string/arm/memcpy.S
> > +++ b/src/string/arm/memcpy.S
> > @@ -43,6 +43,8 @@
> >   * building as thumb 2 and big-endian.
> >   */
> >
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > +
> >  .syntax unified
> >
> >  .global memcpy
> > @@ -477,3 +479,4 @@ copy_last_3_and_return:
> >       ldmfd   sp!, {r0, r4, lr}
> >       bx      lr
> >
> > +#endif // defined(__thumb2__)  ||  !defined(__thumb__)
>
> The logic to enable the C memcpy seems to be missing.

I wasn't actually sure of the best way to do this. In my current
version, I sort of followed the lead of the stuff in src/math/arm and
made it a C file that will either conditionally have the ARM version
as inline assembly or include the generic C version. Is there a better
way? The diff didn't capture this because of the rename; should I
attach it separately?

> > diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> > index 09de65aa..cb5d757d 100644
> > --- a/src/thread/arm/__set_thread_area.c
> > +++ b/src/thread/arm/__set_thread_area.c
> > @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
> >
> >  int __set_thread_area(void *p)
> >  {
> > -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > +#if !__ARM_ARCH_ISA_ARM
> > +     __a_gettp_ptr = __a_gettp_kuser;
> > +     __a_cas_ptr = __a_cas_kuser;
> > +     __a_barrier_ptr = __a_barrier_kuser;
> > +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> >       if (__hwcap & HWCAP_TLS) {
> >               size_t *aux;
> >               __a_cas_ptr = __a_cas_v7;
>
> HWCAP needs to be honored here. You also need to add the new logic to
> use the syscall for getting thread pointer, I think.

I updated this to use the "get_tls" syscall (0xf0006) for M-Profile
devices. The barrier and cas pointers point to the default dummy
functions.

I'm not entirely sure how I should handle HWCAP. What should I do if
TLS is or isn't set? If it isn't set, the original code uses those
'kuser' addresses that you pointed out aren't usable and will bail out
if the version at the special address isn't correct. Right now, I have
mine set to call a_crash() if TLS is not set in HWCAP because that
seems to be the closest I can come to the current functionality.

> > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > index 29c2d07b..fe0406e3 100644
> > --- a/src/thread/arm/__unmapself.s
> > +++ b/src/thread/arm/__unmapself.s
> > @@ -3,7 +3,7 @@
> >  .global __unmapself
> >  .type   __unmapself,%function
> >  __unmapself:
> > -     mov r7,#91
> > +     movs r7,#91
> >       svc 0
> > -     mov r7,#1
> > +     movs r7,#1
> >       svc 0
>
> OK, but this file can't be used on nommu anyway. Instead the generic C
> version has to be used.

LIke memcpy above (and the stuffi in src/math/arm), I made this a C
file that will either have this code as inline assembly or will
include the generic C version. The diff didn't capture this, should I
attach this separately?

> > diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> > index da50508d..d49809d9 100644
> > --- a/src/thread/arm/atomics.S
> > +++ b/src/thread/arm/atomics.S
> > @@ -1,3 +1,5 @@
> > +#if __ARM_ARCH_ISA_ARM
> > +
> >  .syntax unified
> >  .text
> >
> > @@ -104,3 +106,5 @@ __a_cas_ptr:
> >  .hidden __a_gettp_ptr
> >  __a_gettp_ptr:
> >       .word __a_gettp_cp15
> > +
> > +#endif
>
> This file should not be conditional. If it can't be assembled in
> default mode it just needs ISA level override directives. Which
> functions are actually called is selected at runtime.

I removed the condition so that everything in here is always built and
added directives as needed.

I did run into an issue, though. Is there a way to restore the default
architecture after setting it using ".arch <arch>"? Something like
".set push" and ".set pop" in MIPS assembly? I didn't realize what the
"scope" of those .arch directives were until it occurred to me that
one of the functions shouldn't have built for ARMv6-M. I rearranged
the functions so that the ones without .arch directives come first,
which did take care of the problem for now. Trying ".arch" by itself
gets me an error from Clang.

> > diff --git a/src/thread/arm/clone.S b/src/thread/arm/clone.S
> > index bb0965da..70d29f70 100644
> > --- a/src/thread/arm/clone.S
> > +++ b/src/thread/arm/clone.S
> > @@ -4,24 +4,30 @@
> >  .hidden __clone
> >  .type   __clone,%function
> >  __clone:
> > -     stmfd sp!,{r4,r5,r6,r7}
> > -     mov r7,#120
> > +     push {r4,r5,r6,r7}
> > +     movs r7,#120
> >       mov r6,r3
> >       mov r5,r0
> >       mov r0,r2
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       and r1,r1,#-16
> > +#else
> > +     movs r2,#0
> > +     subs r2,#16
> > +     ands r1,r2
> > +#endif
> >       ldr r2,[sp,#16]
> >       ldr r3,[sp,#20]
> >       ldr r4,[sp,#24]
> >       svc 0
> >       tst r0,r0
> >       beq 1f
> > -     ldmfd sp!,{r4,r5,r6,r7}
> > +     pop {r4,r5,r6,r7}
> >       bx lr
> >
> >  1:   mov r0,r6
> >       bl 3f
> > -2:   mov r7,#1
> > +2:   movs r7,#1
> >       svc 0
> >       b 2b
>
> I think this can all be unconditional, no #if's.

Changed to remove the #ifs.

> > diff --git a/src/thread/arm/syscall_cp.S b/src/thread/arm/syscall_cp.S
> > index e607dd42..f53cbb3a 100644
> > --- a/src/thread/arm/syscall_cp.S
> > +++ b/src/thread/arm/syscall_cp.S
> > @@ -11,7 +11,7 @@
> >  .type __syscall_cp_asm,%function
> >  __syscall_cp_asm:
> >       mov ip,sp
> > -     stmfd sp!,{r4,r5,r6,r7}
> > +     push {r4,r5,r6,r7}
> >  __cp_begin:
> >       ldr r0,[r0]
> >       cmp r0,#0
> > @@ -19,11 +19,16 @@ __cp_begin:
> >       mov r7,r1
> >       mov r0,r2
> >       mov r1,r3
> > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> >       ldmfd ip,{r2,r3,r4,r5,r6}
> > +#else
> > +     mov r2,ip
> > +     ldmfd r2,{r2,r3,r4,r5,r6}
> > +#endif
> >       svc 0
> >  __cp_end:
> > -     ldmfd sp!,{r4,r5,r6,r7}
> > +     pop {r4,r5,r6,r7}
> >       bx lr
> >  __cp_cancel:
> > -     ldmfd sp!,{r4,r5,r6,r7}
> > +     pop {r4,r5,r6,r7}
> >       b __cancel
> > diff --git a/tools/install.sh b/tools/install.sh
> > old mode 100755
> > new mode 100644
>
> This can be unconditional, and maybe a register other than ip can be
> used to make it simpler..?
>
> Rich

Removed the condition.

Thanks for your help!
Jesse

[-- Attachment #2: musl_cortexm_v2.diff --]
[-- Type: application/octet-stream, Size: 11640 bytes --]

diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
index 9e3937cc..fa8a8573 100644
--- a/arch/arm/atomic_arch.h
+++ b/arch/arm/atomic_arch.h
@@ -27,16 +27,6 @@ static inline int a_sc(volatile int *p, int v)
 	return !r;
 }
 
-#if __ARM_ARCH_7A__ || __ARM_ARCH_7R__ ||  __ARM_ARCH >= 7
-
-#define a_barrier a_barrier
-static inline void a_barrier()
-{
-	__asm__ __volatile__ ("dmb ish" : : : "memory");
-}
-
-#endif
-
 #define a_pre_llsc a_barrier
 #define a_post_llsc a_barrier
 
@@ -62,13 +52,22 @@ static inline int a_cas(volatile int *p, int t, int s)
 
 #endif
 
-#ifndef a_barrier
 #define a_barrier a_barrier
+#if __ARM_ARCH_6M__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ ||  __ARM_ARCH >= 7
+
+static inline void a_barrier()
+{
+	__asm__ __volatile__ ("dmb ish" : : : "memory");
+}
+
+#else
+
 static inline void a_barrier()
 {
 	register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
 	__asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
 }
+
 #endif
 
 #define a_crash a_crash
diff --git a/arch/arm/crt_arch.h b/arch/arm/crt_arch.h
index 99508b1d..66080422 100644
--- a/arch/arm/crt_arch.h
+++ b/arch/arm/crt_arch.h
@@ -3,13 +3,15 @@ __asm__(
 ".global " START " \n"
 ".type " START ",%function \n"
 START ": \n"
-"	mov fp, #0 \n"
-"	mov lr, #0 \n"
+"	movs a3, #0 \n"
+"	mov fp, a3 \n"
+"	mov lr, a3 \n"
 "	ldr a2, 1f \n"
 "	add a2, pc, a2 \n"
 "	mov a1, sp \n"
-"2:	and ip, a1, #-16 \n"
-"	mov sp, ip \n"
+"2:	subs a3, #16 \n"
+"	ands a1, a3 \n"
+"	mov sp, a1 \n"
 "	bl " START "_c \n"
 ".weak _DYNAMIC \n"
 ".hidden _DYNAMIC \n"
diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
index e689ea21..ab8f2af2 100644
--- a/arch/arm/pthread_arch.h
+++ b/arch/arm/pthread_arch.h
@@ -1,5 +1,6 @@
-#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+#if !(__ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)) \
+ && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
 
 static inline pthread_t __pthread_self()
 {
diff --git a/crt/arm/crtn.s b/crt/arm/crtn.s
index dc020f92..d751f8cd 100644
--- a/crt/arm/crtn.s
+++ b/crt/arm/crtn.s
@@ -1,9 +1,11 @@
 .syntax unified
 
 .section .init
-	pop {r0,lr}
+	pop {r0,r1}
+	mov lr,r1
 	bx lr
 
 .section .fini
-	pop {r0,lr}
+	pop {r0,r1}
+	mov lr,r1
 	bx lr
diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
index 3ae133c9..f3fcdf43 100644
--- a/src/ldso/arm/tlsdesc.S
+++ b/src/ldso/arm/tlsdesc.S
@@ -12,13 +12,18 @@ __tlsdesc_static:
 .hidden __tlsdesc_dynamic
 .type __tlsdesc_dynamic,%function
 __tlsdesc_dynamic:
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	push {r2,r3,ip,lr}
+#else
+	push {r2-r4,lr}
+#endif
 	ldr r1,[r0]
 	ldr r2,[r1,#4]  // r2 = offset
 	ldr r1,[r1]     // r1 = modid
 
-#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+#if !(__ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)) \
+ && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
 	mrc p15,0,r0,c13,c0,3
 #else
 	ldr r0,1f
@@ -36,19 +41,33 @@ __tlsdesc_dynamic:
 	bx r0
 #endif
 #endif
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	ldr r3,[r0,#-4] // r3 = dtv
 	ldr ip,[r3,r1,LSL #2]
 	sub r0,ip,r0
+#else
+	mov r4,r0
+	subs r4,#4
+	ldr r3,[r4]
+	lsls r4,r1,#2
+	ldr r4,[r3,r4]
+	subs r0,r4,r0
+#endif
 	add r0,r0,r2    // r0 = r3[r1]-r0+r2
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 #if __ARM_ARCH >= 5
 	pop {r2,r3,ip,pc}
 #else
 	pop {r2,r3,ip,lr}
 	bx lr
 #endif
+#else
+	pop {r2-r4,pc}
+#endif
 
-#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+#if !(__ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)) \
+ && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
 #else
 	.align 2
 1:	.word __a_gettp_ptr - 2b
diff --git a/src/process/arm/vfork.s b/src/process/arm/vfork.s
index d7ec41b3..b6f0260e 100644
--- a/src/process/arm/vfork.s
+++ b/src/process/arm/vfork.s
@@ -3,7 +3,7 @@
 .type vfork,%function
 vfork:
 	mov ip, r7
-	mov r7, 190
+	movs r7, 190
 	svc 0
 	mov r7, ip
 	.hidden __syscall_ret
diff --git a/src/setjmp/arm/longjmp.S b/src/setjmp/arm/longjmp.S
index 8df0b819..0241011a 100644
--- a/src/setjmp/arm/longjmp.S
+++ b/src/setjmp/arm/longjmp.S
@@ -5,18 +5,36 @@
 .type longjmp,%function
 _longjmp:
 longjmp:
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	mov ip,r0
 	movs r0,r1
 	moveq r0,#1
 	ldmia ip!, {v1,v2,v3,v4,v5,v6,sl,fp}
 	ldmia ip!, {r2,lr}
 	mov sp,r2
-
+#else
+	mov ip,r0
+	movs r0,r1
+	bne 4f
+	movs r0,#1
+4:	mov r1,ip
+	adds r1,#16
+	ldmia r1!, {r2-r7}
+	mov lr,r7
+	mov sp,r6
+	mov r11,r5
+	mov r10,r4
+	mov r9,r3
+	mov r8,r2
+	mov ip,r1
+	subs r1,#40
+	ldmia r1!, {r4-r7}
+#endif
 	adr r1,1f
 	ldr r2,1f
 	ldr r1,[r1,r2]
 
-#if __ARM_ARCH < 8
+#if !__ARM_ARCH_6M__ && !__ARM_ARCH_7M__ && !__ARM_ARCH_7EM__ && __ARM_ARCH < 8
 	tst r1,#0x260
 	beq 3f
 	// HWCAP_ARM_FPA
@@ -24,14 +42,15 @@ longjmp:
 	beq 2f
 	ldc p2, cr4, [ip], #48
 #endif
-2:	tst r1,#0x40
+2:	movs r2,#0x40
+	tst r1,r2
 	beq 2f
 	.fpu vfp
 	vldmia ip!, {d8-d15}
 	.fpu softvfp
 	.eabi_attribute 10, 0
 	.eabi_attribute 27, 0
-#if __ARM_ARCH < 8
+#if !__ARM_ARCH_6M__ && !__ARM_ARCH_7M__ && !__ARM_ARCH_7EM__ && __ARM_ARCH < 8
 	// HWCAP_ARM_IWMMXT
 2:	tst r1,#0x200
 	beq 3f
diff --git a/src/setjmp/arm/setjmp.S b/src/setjmp/arm/setjmp.S
index 45731d22..98e71068 100644
--- a/src/setjmp/arm/setjmp.S
+++ b/src/setjmp/arm/setjmp.S
@@ -8,17 +8,28 @@
 __setjmp:
 _setjmp:
 setjmp:
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	mov ip,r0
 	stmia ip!,{v1,v2,v3,v4,v5,v6,sl,fp}
 	mov r2,sp
 	stmia ip!,{r2,lr}
-	mov r0,#0
-
+#else
+	stmia r0!,{r4-r7}
+	mov r1,r8
+	mov r2,r9
+	mov r3,r10
+	mov r4,r11
+	mov r5,sp
+	mov r6,lr
+	stmia r0!,{r1-r6}
+	mov ip,r0
+#endif
+	movs r0,#0
 	adr r1,1f
 	ldr r2,1f
 	ldr r1,[r1,r2]
 
-#if __ARM_ARCH < 8
+#if !__ARM_ARCH_6M__ && !__ARM_ARCH_7M__ && !__ARM_ARCH_7EM__ && __ARM_ARCH < 8
 	tst r1,#0x260
 	beq 3f
 	// HWCAP_ARM_FPA
@@ -26,14 +37,15 @@ setjmp:
 	beq 2f
 	stc p2, cr4, [ip], #48
 #endif
-2:	tst r1,#0x40
+2:	movs r2,#0x40
+	tst r1,r2
 	beq 2f
 	.fpu vfp
 	vstmia ip!, {d8-d15}
 	.fpu softvfp
 	.eabi_attribute 10, 0
 	.eabi_attribute 27, 0
-#if __ARM_ARCH < 8
+#if !__ARM_ARCH_6M__ && !__ARM_ARCH_7M__ && !__ARM_ARCH_7EM__ && __ARM_ARCH < 8
 	// HWCAP_ARM_IWMMXT
 2:	tst r1,#0x200
 	beq 3f
diff --git a/src/signal/arm/restore.s b/src/signal/arm/restore.s
index fb086d9b..2b7621b1 100644
--- a/src/signal/arm/restore.s
+++ b/src/signal/arm/restore.s
@@ -4,12 +4,12 @@
 .hidden __restore
 .type __restore,%function
 __restore:
-	mov r7,#119
+	movs r7,#119
 	swi 0x0
 
 .global __restore_rt
 .hidden __restore_rt
 .type __restore_rt,%function
 __restore_rt:
-	mov r7,#173
+	movs r7,#173
 	swi 0x0
diff --git a/src/signal/arm/sigsetjmp.s b/src/signal/arm/sigsetjmp.s
index 69ebbf49..48ebe7cb 100644
--- a/src/signal/arm/sigsetjmp.s
+++ b/src/signal/arm/sigsetjmp.s
@@ -9,16 +9,26 @@ __sigsetjmp:
 	bne 1f
 	b setjmp
 
-1:	str lr,[r0,#256]
-	str r4,[r0,#260+8]
+1:	mov ip,r2
+	mov r2,lr
+	adds r0,#200
+	str r2,[r0,#56]
+	str r4,[r0,#60+8]
+	subs r0,#200
+	mov r2,ip
 	mov r4,r0
 
 	bl setjmp
 
 	mov r1,r0
 	mov r0,r4
-	ldr lr,[r0,#256]
-	ldr r4,[r0,#260+8]
+	mov ip,r2
+	adds r0,#200
+	ldr r2,[r0,#56]
+	mov lr,r2
+	ldr r4,[r0,#60+8]
+	subs r0,#200
+	mov r2,ip
 
 .hidden __sigsetjmp_tail
 	b __sigsetjmp_tail
diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
index 09de65aa..adccd2ce 100644
--- a/src/thread/arm/__set_thread_area.c
+++ b/src/thread/arm/__set_thread_area.c
@@ -8,7 +8,7 @@
 extern hidden const unsigned char
 	__a_barrier_oldkuser[], __a_barrier_v6[], __a_barrier_v7[],
 	__a_cas_v6[], __a_cas_v7[],
-	__a_gettp_cp15[];
+	__a_gettp_cp15[], __a_gettp_syscall[];
 
 #define __a_barrier_kuser 0xffff0fa0
 #define __a_barrier_oldkuser (uintptr_t)__a_barrier_oldkuser
@@ -26,7 +26,13 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
 
 int __set_thread_area(void *p)
 {
-#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
+#if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)
+	if (__hwcap & HWCAP_TLS) {
+		__a_gettp_ptr = __a_gettp_syscall;
+	} else {
+		a_crash();
+	}
+#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
 	if (__hwcap & HWCAP_TLS) {
 		size_t *aux;
 		__a_cas_ptr = __a_cas_v7;
diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
index 29c2d07b..fe0406e3 100644
--- a/src/thread/arm/__unmapself.s
+++ b/src/thread/arm/__unmapself.s
@@ -3,7 +3,7 @@
 .global __unmapself
 .type   __unmapself,%function
 __unmapself:
-	mov r7,#91
+	movs r7,#91
 	svc 0
-	mov r7,#1
+	movs r7,#1
 	svc 0
diff --git a/src/thread/arm/atomics.s b/src/thread/arm/atomics.s
index da50508d..af3b60a2 100644
--- a/src/thread/arm/atomics.s
+++ b/src/thread/arm/atomics.s
@@ -7,10 +7,35 @@
 __a_barrier_dummy:
 	bx lr
 
+.global __a_cas_dummy
+.hidden __a_cas_dummy
+.type __a_cas_dummy,%function
+__a_cas_dummy:
+	mov r3,r0
+	ldr r0,[r2]
+	subs r0,r3,r0
+	bne 1f
+	str r1,[r2]
+1:	bx lr
+
+.global __a_gettp_cp15
+.hidden __a_gettp_cp15
+.type __a_gettp_cp15,%function
+__a_gettp_syscall:
+	push {r7}
+	movs r7,#0xf
+	lsls r7,r7,#16
+	adds r7,#6          /* ARM get_tls syscall (0xf0006) */
+	svc 0
+	pop {r7}
+	bx lr
+
 .global __a_barrier_oldkuser
 .hidden __a_barrier_oldkuser
 .type __a_barrier_oldkuser,%function
 __a_barrier_oldkuser:
+	.arch armv6
+	.arm
 	push {r0,r1,r2,r3,ip,lr}
 	mov r1,r0
 	mov r2,sp
@@ -36,16 +61,6 @@ __a_barrier_v7:
 	dmb ish
 	bx lr
 
-.global __a_cas_dummy
-.hidden __a_cas_dummy
-.type __a_cas_dummy,%function
-__a_cas_dummy:
-	mov r3,r0
-	ldr r0,[r2]
-	subs r0,r3,r0
-	streq r1,[r2]
-	bx lr
-
 .global __a_cas_v6
 .hidden __a_cas_v6
 .type __a_cas_v6,%function
@@ -80,6 +95,8 @@ __a_cas_v7:
 .hidden __a_gettp_cp15
 .type __a_gettp_cp15,%function
 __a_gettp_cp15:
+	.arch armv6
+	.arm
 	mrc p15,0,r0,c13,c0,3
 	bx lr
 
diff --git a/src/thread/arm/clone.s b/src/thread/arm/clone.s
index bb0965da..33c2e59b 100644
--- a/src/thread/arm/clone.s
+++ b/src/thread/arm/clone.s
@@ -4,24 +4,26 @@
 .hidden __clone
 .type   __clone,%function
 __clone:
-	stmfd sp!,{r4,r5,r6,r7}
-	mov r7,#120
+	push {r4,r5,r6,r7}
+	movs r7,#120
 	mov r6,r3
 	mov r5,r0
 	mov r0,r2
-	and r1,r1,#-16
+	movs r2,#0
+	subs r2,#16
+	ands r1,r2
 	ldr r2,[sp,#16]
 	ldr r3,[sp,#20]
 	ldr r4,[sp,#24]
 	svc 0
 	tst r0,r0
 	beq 1f
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	bx lr
 
 1:	mov r0,r6
 	bl 3f
-2:	mov r7,#1
+2:	movs r7,#1
 	svc 0
 	b 2b
 
diff --git a/src/thread/arm/syscall_cp.s b/src/thread/arm/syscall_cp.s
index e607dd42..421e64f4 100644
--- a/src/thread/arm/syscall_cp.s
+++ b/src/thread/arm/syscall_cp.s
@@ -11,7 +11,7 @@
 .type __syscall_cp_asm,%function
 __syscall_cp_asm:
 	mov ip,sp
-	stmfd sp!,{r4,r5,r6,r7}
+	push {r4,r5,r6,r7}
 __cp_begin:
 	ldr r0,[r0]
 	cmp r0,#0
@@ -19,11 +19,12 @@ __cp_begin:
 	mov r7,r1
 	mov r0,r2
 	mov r1,r3
-	ldmfd ip,{r2,r3,r4,r5,r6}
+	mov r2,ip
+	ldmfd r2,{r2,r3,r4,r5,r6}
 	svc 0
 __cp_end:
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	bx lr
 __cp_cancel:
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	b __cancel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-17  5:10     ` Patrick Oppenlander
@ 2020-12-18  7:17       ` Jesse DeGuire
  2020-12-22 21:43         ` Patrick Oppenlander
  0 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2020-12-18  7:17 UTC (permalink / raw)
  To: Patrick Oppenlander; +Cc: musl

On Thu, Dec 17, 2020 at 12:10 AM Patrick Oppenlander
<patrick.oppenlander@gmail.com> wrote:
>
> On Thu, Dec 17, 2020 at 3:55 PM Patrick Oppenlander
> <patrick.oppenlander@gmail.com> wrote:
> >
> > On Thu, Dec 17, 2020 at 11:24 AM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > > Hey everyone,
> > > >
> > > > I'm working on putting together a Clang-based toolchain to use with
> > > > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > > > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > > > the C library. Currently, I'm trying to get it to build for a few
> > > > different Cortex-M devices and have found that Musl builds fine for
> > > > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > > > instructions not supported on the Thumb ISA subset used by v6-M and
> > > > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > > > can easily switch to HEAD if needed. I am using a Python script to
> > > > build Musl (and eventually the rest of the toolchain), which you can
> > > > see on GitHub at the following link. It's a bit of a mess at the
> > > > moment, but the build_musl() function is what I'm currently using to
> > > > build Musl.
> > >
> > > I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> > > it is, let's have a look.
> > >
> > > > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> > > >
> > > > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > > > Baseline and have attached a diff to this email. If you like, I can go
> > > > into more detail as to why I made the changes I made; however, many
> > > > changes were merely the result of my attempts to correct errors
> > > > reported by Clang due to it encountering instruction sequences not
> > > > supported on ARMv6-M.
> > >
> > > Are there places where clang's linker is failing to make substitutions
> > > that the GNU one can do, that would make this simpler? For example I
> > > know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> > > relocation the assembler emits on the insn).
> > >
> > > > A number of errors were simply the result of
> > > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > > files I had to change from a "lower case s" to a "capital-S" file so
> > > > that I could use macros to check for either the Thumb1 ISA
> > > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > > ("!__ARM_ARCH_ISA_ARM").
> > >
> > > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > > If not this may need an alternate detection. But I'd like to drop as
> > > much as possible and just make the code compatible rather than having
> > > 2 versions of it. I don't think there are any places where the
> > > performance or size is at all relevant.
> > >
> > > > The changes under
> > > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > > those because I don't believe Cortex-M devices could handle what was
> > > > there (no M-Profile device has Coprocessor 15, for example) and so I
> > >
> > > Unless this is an ISA level that can't be executed on a normal (non-M)
> > > ARM profile, it still needs all the backends that might be needed and
> > > runtime selection of which to use. This is okay. I believe Linux for
> > > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > > needs to be added as a backend. The right way to do this would have
> > > been with trap-and-emulate (of cp15) I think...
> >
> > Linux emulates mrc 15 on old -A cores but they decided not to on -M
> > for some reason. BTW, the syscall is called get_tls.
> >
> > Is there any option other than supporting the get_tls syscall? Even if
> > someone puts in the effort to wire up the trap-and-emulate backend,
> > musl linked executables will still only run on new kernels.
> >
> > I took the trap-and-emulate approach in Apex RTOS to avoid opening
> > this can of worms. It's the only missing link for musl on armv7-m.
> > Everything else works beautifully.
>
> Another consideration is qemu-user: Currently it aborts when
> encountering an mrc 15 instruction while emulating armv7-m. I guess
> that would probably also be solved by supporting the syscall.
>
> Patrick

ARMv6-M and v8-M.base do not support the MRC instruction at all. Could
that play into why Linux and qemu bail?

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-18  7:15   ` Jesse DeGuire
@ 2020-12-18 17:30     ` Rich Felker
  2020-12-21 23:58       ` Jesse DeGuire
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Felker @ 2020-12-18 17:30 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl

On Fri, Dec 18, 2020 at 02:15:20AM -0500, Jesse DeGuire wrote:
> > > A number of errors were simply the result of
> > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > files I had to change from a "lower case s" to a "capital-S" file so
> > > that I could use macros to check for either the Thumb1 ISA
> > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > ("!__ARM_ARCH_ISA_ARM").
> >
> > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > If not this may need an alternate detection. But I'd like to drop as
> > much as possible and just make the code compatible rather than having
> > 2 versions of it. I don't think there are any places where the
> > performance or size is at all relevant.
> 
> I hadn't thought enough about that until your reply, so I used
> Compiler Explorer to at least get a general idea of when that macro
> was supported. Clang 3.5 supports it, but Clang 3.4.1 does not. GCC
> 5.4 supports it, but 4.64 (the previous version available on Compiler
> Explorer) does not. With that, I tried to come up with something to
> detect M-Profile devices and have updated my changes with it.
> 
>    #if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ \
>      || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)

I think you can just use presence of __ARM_ARCH as a proxy for
"checking __ARM_ARCH_ISA_ARM is meaningful" if you can verify that
they were added at the same time. Then you don't need to special-case
each M variant. Or am I missing something?

> > > The changes under
> > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > those because I don't believe Cortex-M devices could handle what was
> > > there (no M-Profile device has Coprocessor 15, for example) and so I
> >
> > Unless this is an ISA level that can't be executed on a normal (non-M)
> > ARM profile, it still needs all the backends that might be needed and
> > runtime selection of which to use. This is okay. I believe Linux for
> > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > needs to be added as a backend. The right way to do this would have
> > been with trap-and-emulate (of cp15) I think...
> 
> Out of curiosity, what makes that syscall awful?

Well it precludes just using the same inline code as long as the ISA
level is >= a minimum; now levels supporting Cortex-M have to use the
runtime selection. I'm not sure which performs better. I would think
they'd be comparable.

> > Further comments inline below:
> >
> > > diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> > > index 9e3937cc..21f102ad 100644
> > > --- a/arch/arm/atomic_arch.h
> > > +++ b/arch/arm/atomic_arch.h
> > > @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
> > >
> > >  #ifndef a_barrier
> > >  #define a_barrier a_barrier
> > > +#if __ARM_ARCH_ISA_ARM
> > >  static inline void a_barrier()
> > >  {
> > >       register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
> > >       __asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
> > >  }
> > > +#else
> > > +static inline void a_barrier()
> > > +{
> > > +     __asm__ __volatile__ ("dmb" : : : "memory");
> > > +}
> > > +#endif
> > >  #endif
> >
> > There's a condition above with this code already; you just need to
> > make it used for an additional case.
> 
> Fixed. The original code was nested inside an ifdef that also would
> have used LDREX and STREX--neither of which ARMv6-M supports--so I
> moved that out and added the extra check.

If it lacks LDREX and STREX how do you implement atomic? I don't see
where you're adding any alternative, so is v6-m support
non-functional? That rather defeats the purpose of doing anything to
support it...

> > > diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> > > index e689ea21..a89fb554 100644
> > > --- a/arch/arm/pthread_arch.h
> > > +++ b/arch/arm/pthread_arch.h
> > > @@ -1,5 +1,5 @@
> > > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> >
> > That's probably not the right condition.
> 
> Can you provide more detail as I'm missing what you mean? The intent
> was to prevent this check from succeeding for any M-Profile device so
> that they fall back to the __a gettp_ptr solution. I did update this
> check to what I described at the top of this reply.

I think it's only the __ARM_ARCH >= 7 branch that needs any further
condition. The rest of the branches are already specific to non-M.

> > >  static inline pthread_t __pthread_self()
> > >  {
> > > diff --git a/configure b/configure
> > > old mode 100755
> > > new mode 100644
> > > diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> > > index dc020f92..83e20da2 100644
> > > --- a/crt/arm/crtn.S
> > > +++ b/crt/arm/crtn.S
> > > @@ -1,9 +1,17 @@
> > >  .syntax unified
> > >
> > >  .section .init
> > > +#if __ARM_ARCH_ISA_ARM
> > >       pop {r0,lr}
> > >       bx lr
> > > +#else
> > > +     pop {r0,pc}
> > > +#endif
> > >
> > >  .section .fini
> > > +#if __ARM_ARCH_ISA_ARM
> > >       pop {r0,lr}
> > >       bx lr
> > > +#else
> > > +     pop {r0,pc}
> > > +#endif
> >
> > Ideally the linker could handle this.
> 
> The issue here is that ARMv6-M (and I believe v8-M.base) cannot POP
> into LR. Can this sequence replace LR with R1? My current version POPs
> into R1 and moves that into LR because I wasn't sure if that was
> important for the caller.

I don't see any reason you couldn't just use a different
call-clobbered register.

> > > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > > index 3ae133c9..1843d97d 100644
> > > --- a/src/ldso/arm/tlsdesc.S
> > > +++ b/src/ldso/arm/tlsdesc.S
> > > @@ -12,13 +12,19 @@ __tlsdesc_static:
> > >  .hidden __tlsdesc_dynamic
> > >  .type __tlsdesc_dynamic,%function
> > >  __tlsdesc_dynamic:
> > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > >       push {r2,r3,ip,lr}
> > > +#else
> > > +     push {r2-r4,lr}
> > > +     mov r2,ip
> > > +     push {r2}
> > > +#endif
> >
> > AFAICT ip is not special here. If it's a pain to push/pop, just use a
> > different register.
> 
> Gotcha. It looks like there isn't really a good reason for the
> original to use IP, either, so I could change that and merge a couple
> of those alternate code paths. If you don't mind a few extra
> instructions, I think I can get rid of all of the (__thumb2__ ||
> !__thumb__) checks in there.

This is an extreme hot path so it should probably be optimized with
preprocessor where it helps, but that doesn't mean it should have
gratuitous PP branches just to pick which register to use when a
common one that works everywhere can be chosen.

> > >  .hidden __sigsetjmp_tail
> > >       b __sigsetjmp_tail
> > > diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> > > index 869e3448..c5d17533 100644
> > > --- a/src/string/arm/memcpy.S
> > > +++ b/src/string/arm/memcpy.S
> > > @@ -43,6 +43,8 @@
> > >   * building as thumb 2 and big-endian.
> > >   */
> > >
> > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > +
> > >  .syntax unified
> > >
> > >  .global memcpy
> > > @@ -477,3 +479,4 @@ copy_last_3_and_return:
> > >       ldmfd   sp!, {r0, r4, lr}
> > >       bx      lr
> > >
> > > +#endif // defined(__thumb2__)  ||  !defined(__thumb__)
> >
> > The logic to enable the C memcpy seems to be missing.
> 
> I wasn't actually sure of the best way to do this. In my current
> version, I sort of followed the lead of the stuff in src/math/arm and
> made it a C file that will either conditionally have the ARM version
> as inline assembly or include the generic C version. Is there a better
> way? The diff didn't capture this because of the rename; should I
> attach it separately?

Don't move giant asm functions into a .c file. You can just add a new
file that's conditionally empty. See how it was before commit
9dce93ac7f7a76978b70581c6f073f671b583347.

> > > diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> > > index 09de65aa..cb5d757d 100644
> > > --- a/src/thread/arm/__set_thread_area.c
> > > +++ b/src/thread/arm/__set_thread_area.c
> > > @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
> > >
> > >  int __set_thread_area(void *p)
> > >  {
> > > -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > +#if !__ARM_ARCH_ISA_ARM
> > > +     __a_gettp_ptr = __a_gettp_kuser;
> > > +     __a_cas_ptr = __a_cas_kuser;
> > > +     __a_barrier_ptr = __a_barrier_kuser;
> > > +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > >       if (__hwcap & HWCAP_TLS) {
> > >               size_t *aux;
> > >               __a_cas_ptr = __a_cas_v7;
> >
> > HWCAP needs to be honored here. You also need to add the new logic to
> > use the syscall for getting thread pointer, I think.
> 
> I updated this to use the "get_tls" syscall (0xf0006) for M-Profile
> devices. The barrier and cas pointers point to the default dummy
> functions.
> 
> I'm not entirely sure how I should handle HWCAP. What should I do if
> TLS is or isn't set? If it isn't set, the original code uses those
> 'kuser' addresses that you pointed out aren't usable and will bail out
> if the version at the special address isn't correct. Right now, I have
> mine set to call a_crash() if TLS is not set in HWCAP because that
> seems to be the closest I can come to the current functionality.

The current HWCAP_TLS handling is for *all* pre-v7 ISA levels. By
convention its meaning is presence of the cp15 thread pointer, but
presence of this also implies atomics can be used directly (and
absence of this means the kernel must be providing kuser_helper, so we
just use it for atomics too).

With M profile support, though, AIUI it's possible that you have the
atomics but not the thread pointer. You should not assume that lack of
HWCAP_TLS implies lack of the atomics; rather you just have to assume
the atomics are present, I think, or use some other means of detection
with fallback to interrupt masking (assuming these devices have no
kernel/user distinction that prevents you from masking interrupts).
HWCAP_TLS should probably be probed just so you don't assume the
syscall exists in case a system omits it and does trap-and-emulate for
the instruction instead.

> > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > > index 29c2d07b..fe0406e3 100644
> > > --- a/src/thread/arm/__unmapself.s
> > > +++ b/src/thread/arm/__unmapself.s
> > > @@ -3,7 +3,7 @@
> > >  .global __unmapself
> > >  .type   __unmapself,%function
> > >  __unmapself:
> > > -     mov r7,#91
> > > +     movs r7,#91
> > >       svc 0
> > > -     mov r7,#1
> > > +     movs r7,#1
> > >       svc 0
> >
> > OK, but this file can't be used on nommu anyway. Instead the generic C
> > version has to be used.
> 
> LIke memcpy above (and the stuffi in src/math/arm), I made this a C
> file that will either have this code as inline assembly or will
> include the generic C version. The diff didn't capture this, should I
> attach this separately?

You can see how it's done in src/thread/sh for nommu support. Although
I'd actually prefer moving the asm to inline in a .c file. If you do
this, use a real C function with inline asm (note: you can't break it
into multiple asm blocks since there's no stack after the first
syscall and thus C code can't run) rather than stuffing the asm
function in a top-level __asm__ statement.

> > > diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> > > index da50508d..d49809d9 100644
> > > --- a/src/thread/arm/atomics.S
> > > +++ b/src/thread/arm/atomics.S
> > > @@ -1,3 +1,5 @@
> > > +#if __ARM_ARCH_ISA_ARM
> > > +
> > >  .syntax unified
> > >  .text
> > >
> > > @@ -104,3 +106,5 @@ __a_cas_ptr:
> > >  .hidden __a_gettp_ptr
> > >  __a_gettp_ptr:
> > >       .word __a_gettp_cp15
> > > +
> > > +#endif
> >
> > This file should not be conditional. If it can't be assembled in
> > default mode it just needs ISA level override directives. Which
> > functions are actually called is selected at runtime.
> 
> I removed the condition so that everything in here is always built and
> added directives as needed.
> 
> I did run into an issue, though. Is there a way to restore the default
> architecture after setting it using ".arch <arch>"? Something like
> ".set push" and ".set pop" in MIPS assembly? I didn't realize what the
> "scope" of those .arch directives were until it occurred to me that
> one of the functions shouldn't have built for ARMv6-M. I rearranged
> the functions so that the ones without .arch directives come first,
> which did take care of the problem for now. Trying ".arch" by itself
> gets me an error from Clang.

Look in setjmp.s at the .fpu and .eabi_attribute usage. I'm not sure
it's entirely the same but should be analogous. Basically you just
want to override whatever ISA-level/ABI tags the assembler tries to
put on the file.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-17  0:23 ` Rich Felker
  2020-12-17  4:55   ` Patrick Oppenlander
  2020-12-18  7:15   ` Jesse DeGuire
@ 2020-12-18 19:38   ` Markus Wichmann
  2020-12-18 20:34     ` Rich Felker
  2 siblings, 1 reply; 17+ messages in thread
From: Markus Wichmann @ 2020-12-18 19:38 UTC (permalink / raw)
  To: musl

On Wed, Dec 16, 2020 at 07:23:49PM -0500, Rich Felker wrote:
> On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > index 29c2d07b..fe0406e3 100644
> > --- a/src/thread/arm/__unmapself.s
> > +++ b/src/thread/arm/__unmapself.s
> > @@ -3,7 +3,7 @@
> >  .global __unmapself
> >  .type   __unmapself,%function
> >  __unmapself:
> > -	mov r7,#91
> > +	movs r7,#91
> >  	svc 0
> > -	mov r7,#1
> > +	movs r7,#1
> >  	svc 0
>
> OK, but this file can't be used on nommu anyway. Instead the generic C
> version has to be used.
>

Speaking of that (C) version, maybe someone should note down somewhere
that it is only safe to call if simultaneous calls from other threads
are precluded somehow. That function uses global variables, so if
multiple threads call it at the same time, there will be blood (well,
there will be clobbering in any case). Which is fulfilled at the moment:
The only call site of __unmapself() is pthread_exit(), after getting the
thread list lock. That is a global lock, so further threads also
exitting would have to queue up there. But I think it is rather
non-intuitive that the thread-list lock happens to also protect the
global variables in __unmapself.c. And that is a property of the C
version not shared by the assembler versions, which are all thread-safe.

Also, why can't this be used on nommu? I thought the only differences
between nommu and mmu were that you can't call mmap() except for
MAP_ANON, and you can't call fork(). This merely runs munmap() on a
MAP_ANON region, and exit(), and so does the C version, except the C
version sets up its own stack to do so, in order to not saw off the
branch it is sitting on.  What am I missing?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-18 19:38   ` Markus Wichmann
@ 2020-12-18 20:34     ` Rich Felker
  2020-12-22  0:00       ` Jesse DeGuire
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Felker @ 2020-12-18 20:34 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Fri, Dec 18, 2020 at 08:38:49PM +0100, Markus Wichmann wrote:
> On Wed, Dec 16, 2020 at 07:23:49PM -0500, Rich Felker wrote:
> > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself..s
> > > index 29c2d07b..fe0406e3 100644
> > > --- a/src/thread/arm/__unmapself.s
> > > +++ b/src/thread/arm/__unmapself.s
> > > @@ -3,7 +3,7 @@
> > >  .global __unmapself
> > >  .type   __unmapself,%function
> > >  __unmapself:
> > > -	mov r7,#91
> > > +	movs r7,#91
> > >  	svc 0
> > > -	mov r7,#1
> > > +	movs r7,#1
> > >  	svc 0
> >
> > OK, but this file can't be used on nommu anyway. Instead the generic C
> > version has to be used.
> >
> 
> Speaking of that (C) version, maybe someone should note down somewhere
> that it is only safe to call if simultaneous calls from other threads
> are precluded somehow. That function uses global variables, so if
> multiple threads call it at the same time, there will be blood (well,
> there will be clobbering in any case). Which is fulfilled at the moment:
> The only call site of __unmapself() is pthread_exit(), after getting the
> thread list lock. That is a global lock, so further threads also
> exitting would have to queue up there. But I think it is rather
> non-intuitive that the thread-list lock happens to also protect the
> global variables in __unmapself.c. And that is a property of the C
> version not shared by the assembler versions, which are all thread-safe.

Prior to commit 8f11e6127fe93093f81a52b15bb1537edc3fc8af,
__unmapself.c had its own lock and had to explicitly modify the exit
futex address to unlock that lock. Now it can (and must, since neither
can be unlocked before the kernel-level task exits) share the thread
list lock for this purpose.

FWIW there's arguably no reason to even keep the asm versions of this
function now that the thread list lock exists. We could just remove
them all.

> Also, why can't this be used on nommu? I thought the only differences
> between nommu and mmu were that you can't call mmap() except for
> MAP_ANON, and you can't call fork(). This merely runs munmap() on a
> MAP_ANON region, and exit(), and so does the C version, except the C
> version sets up its own stack to do so, in order to not saw off the
> branch it is sitting on.  What am I missing?

Most nommu cpu archs do not have hardware user/kernel stack switching
on interrupt. Instead the interrupt handler starts on the existing
stack. Linux immediately moves to the proper kernel stack at the entry
point, but one or more slots on the stack have already been written at
this point (either by hardware, or by necessity since there's no work
space to start switching to the kernel stack without spilling some
state).

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-18 17:30     ` Rich Felker
@ 2020-12-21 23:58       ` Jesse DeGuire
  2020-12-22  1:39         ` Rich Felker
  0 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2020-12-21 23:58 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Fri, Dec 18, 2020 at 12:30 PM Rich Felker <dalias@libc.org> wrote:
>
> On Fri, Dec 18, 2020 at 02:15:20AM -0500, Jesse DeGuire wrote:
> > > > A number of errors were simply the result of
> > > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > > files I had to change from a "lower case s" to a "capital-S" file so
> > > > that I could use macros to check for either the Thumb1 ISA
> > > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > > ("!__ARM_ARCH_ISA_ARM").
> > >
> > > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > > If not this may need an alternate detection. But I'd like to drop as
> > > much as possible and just make the code compatible rather than having
> > > 2 versions of it. I don't think there are any places where the
> > > performance or size is at all relevant.
> >
> > I hadn't thought enough about that until your reply, so I used
> > Compiler Explorer to at least get a general idea of when that macro
> > was supported. Clang 3.5 supports it, but Clang 3.4.1 does not. GCC
> > 5.4 supports it, but 4.64 (the previous version available on Compiler
> > Explorer) does not. With that, I tried to come up with something to
> > detect M-Profile devices and have updated my changes with it.
> >
> >    #if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ \
> >      || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)
>
> I think you can just use presence of __ARM_ARCH as a proxy for
> "checking __ARM_ARCH_ISA_ARM is meaningful" if you can verify that
> they were added at the same time. Then you don't need to special-case
> each M variant. Or am I missing something?

Nah, I was overly cautious I think. I did search through the commit
histories of GCC and LLVM to find that __ARM_ARCH and
__ARM_ARCH_PROFILE were added at the same time in both toolchains, but
__ARM_ARCH_ISA_ARM appears to have been added later in Clang.
Therefore, doing a check for "__ARM_ARCH_PROFILE == 'M'" should be
sufficient.

> > > > The changes under
> > > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > > those because I don't believe Cortex-M devices could handle what was
> > > > there (no M-Profile device has Coprocessor 15, for example) and so I
> > >
> > > Unless this is an ISA level that can't be executed on a normal (non-M)
> > > ARM profile, it still needs all the backends that might be needed and
> > > runtime selection of which to use. This is okay. I believe Linux for
> > > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > > needs to be added as a backend. The right way to do this would have
> > > been with trap-and-emulate (of cp15) I think...
> >
> > Out of curiosity, what makes that syscall awful?
>
> Well it precludes just using the same inline code as long as the ISA
> level is >= a minimum; now levels supporting Cortex-M have to use the
> runtime selection. I'm not sure which performs better. I would think
> they'd be comparable.

I see, thanks. I've been doing most of my work with MIPS32r2-based
micros for the past few years and from what I've seen so far ARM has
much more variety to deal with, for better or worse.

> > > Further comments inline below:
> > >
> > > > diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> > > > index 9e3937cc..21f102ad 100644
> > > > --- a/arch/arm/atomic_arch.h
> > > > +++ b/arch/arm/atomic_arch.h
> > > > @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
> > > >
> > > >  #ifndef a_barrier
> > > >  #define a_barrier a_barrier
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >  static inline void a_barrier()
> > > >  {
> > > >       register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
> > > >       __asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
> > > >  }
> > > > +#else
> > > > +static inline void a_barrier()
> > > > +{
> > > > +     __asm__ __volatile__ ("dmb" : : : "memory");
> > > > +}
> > > > +#endif
> > > >  #endif
> > >
> > > There's a condition above with this code already; you just need to
> > > make it used for an additional case.
> >
> > Fixed. The original code was nested inside an ifdef that also would
> > have used LDREX and STREX--neither of which ARMv6-M supports--so I
> > moved that out and added the extra check.
>
> If it lacks LDREX and STREX how do you implement atomic? I don't see
> where you're adding any alternative, so is v6-m support
> non-functional? That rather defeats the purpose of doing anything to
> support it...

Correct, I haven't yet added an alternative. Arm's answer--and what we
generally do in the embedded world--is to disable interrupts using
"cpsid", do your thing, then re-enable interrupts with "cpsie". This
could be done with a new "__a_cas_v6m" variant that I'd add to
atomics.s. This still won't work for Linux because the "cps(ie|id)"
instruction is effectively a no-op if it is executed in an
unprivileged context (meaning you can't trap and emulate it). You'd be
looking at another system call if you really wanted v6-m Linux. That
said, this could let Musl work on v6-m in a bare metal or RTOS
environment, which I think Musl would be great for, and so I'd still
work on adding support for it. Also, not all v6-m devices support a
privilege model and run as though everything is privileged.
ARMv8-M.base is similar to v6-m with LDREX and STREX and so that could
have full support.

> > > > diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> > > > index e689ea21..a89fb554 100644
> > > > --- a/arch/arm/pthread_arch.h
> > > > +++ b/arch/arm/pthread_arch.h
> > > > @@ -1,5 +1,5 @@
> > > > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > > > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> > >
> > > That's probably not the right condition.
> >
> > Can you provide more detail as I'm missing what you mean? The intent
> > was to prevent this check from succeeding for any M-Profile device so
> > that they fall back to the __a gettp_ptr solution. I did update this
> > check to what I described at the top of this reply.
>
> I think it's only the __ARM_ARCH >= 7 branch that needs any further
> condition. The rest of the branches are already specific to non-M.

I see, thanks.

> > > >  static inline pthread_t __pthread_self()
> > > >  {
> > > > diff --git a/configure b/configure
> > > > old mode 100755
> > > > new mode 100644
> > > > diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> > > > index dc020f92..83e20da2 100644
> > > > --- a/crt/arm/crtn.S
> > > > +++ b/crt/arm/crtn.S
> > > > @@ -1,9 +1,17 @@
> > > >  .syntax unified
> > > >
> > > >  .section .init
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >       pop {r0,lr}
> > > >       bx lr
> > > > +#else
> > > > +     pop {r0,pc}
> > > > +#endif
> > > >
> > > >  .section .fini
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >       pop {r0,lr}
> > > >       bx lr
> > > > +#else
> > > > +     pop {r0,pc}
> > > > +#endif
> > >
> > > Ideally the linker could handle this.
> >
> > The issue here is that ARMv6-M (and I believe v8-M.base) cannot POP
> > into LR. Can this sequence replace LR with R1? My current version POPs
> > into R1 and moves that into LR because I wasn't sure if that was
> > important for the caller.
>
> I don't see any reason you couldn't just use a different
> call-clobbered register.

That's what I figured, but I wanted to make sure since I'm still
learning how all these pieces fit together. R1 it is, then.

> > > > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > > > index 3ae133c9..1843d97d 100644
> > > > --- a/src/ldso/arm/tlsdesc.S
> > > > +++ b/src/ldso/arm/tlsdesc.S
> > > > @@ -12,13 +12,19 @@ __tlsdesc_static:
> > > >  .hidden __tlsdesc_dynamic
> > > >  .type __tlsdesc_dynamic,%function
> > > >  __tlsdesc_dynamic:
> > > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > >       push {r2,r3,ip,lr}
> > > > +#else
> > > > +     push {r2-r4,lr}
> > > > +     mov r2,ip
> > > > +     push {r2}
> > > > +#endif
> > >
> > > AFAICT ip is not special here. If it's a pain to push/pop, just use a
> > > different register.
> >
> > Gotcha. It looks like there isn't really a good reason for the
> > original to use IP, either, so I could change that and merge a couple
> > of those alternate code paths. If you don't mind a few extra
> > instructions, I think I can get rid of all of the (__thumb2__ ||
> > !__thumb__) checks in there.
>
> This is an extreme hot path so it should probably be optimized with
> preprocessor where it helps, but that doesn't mean it should have
> gratuitous PP branches just to pick which register to use when a
> common one that works everywhere can be chosen.

Perhaps using ip in the original code is to save pushing another
register to memory so I won't mess with it since this is a hot path.

Does this function need to push r2 and r3 and pop them at the end? My
understanding is that the Arm procedure call standard does not require
subroutines to preserve r0-r3 unless their contents are needed across
a function call that may clobber them. The r3 register isn't used here
until after the call to __a_gettp_ptr. The r2 register is used across
the call, but code does not make an attempt to preserve it before the
call or restore it after. __a_getttp_ptr shouldn't mess with it
anyway. The default version just returns a CP15 register and the
Cortex-M version will trigger a syscall, which is treated like other
interrupts and r0-r3+ip are automatically stacked and unstacked on
Cortex-M interrupts.

> > > >  .hidden __sigsetjmp_tail
> > > >       b __sigsetjmp_tail
> > > > diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> > > > index 869e3448..c5d17533 100644
> > > > --- a/src/string/arm/memcpy.S
> > > > +++ b/src/string/arm/memcpy.S
> > > > @@ -43,6 +43,8 @@
> > > >   * building as thumb 2 and big-endian.
> > > >   */
> > > >
> > > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > > +
> > > >  .syntax unified
> > > >
> > > >  .global memcpy
> > > > @@ -477,3 +479,4 @@ copy_last_3_and_return:
> > > >       ldmfd   sp!, {r0, r4, lr}
> > > >       bx      lr
> > > >
> > > > +#endif // defined(__thumb2__)  ||  !defined(__thumb__)
> > >
> > > The logic to enable the C memcpy seems to be missing.
> >
> > I wasn't actually sure of the best way to do this. In my current
> > version, I sort of followed the lead of the stuff in src/math/arm and
> > made it a C file that will either conditionally have the ARM version
> > as inline assembly or include the generic C version. Is there a better
> > way? The diff didn't capture this because of the rename; should I
> > attach it separately?
>
> Don't move giant asm functions into a .c file. You can just add a new
> file that's conditionally empty. See how it was before commit
> 9dce93ac7f7a76978b70581c6f073f671b583347.

Will do. I'm not sure why I didn't think of that.

> > > > diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> > > > index 09de65aa..cb5d757d 100644
> > > > --- a/src/thread/arm/__set_thread_area.c
> > > > +++ b/src/thread/arm/__set_thread_area.c
> > > > @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
> > > >
> > > >  int __set_thread_area(void *p)
> > > >  {
> > > > -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > > +#if !__ARM_ARCH_ISA_ARM
> > > > +     __a_gettp_ptr = __a_gettp_kuser;
> > > > +     __a_cas_ptr = __a_cas_kuser;
> > > > +     __a_barrier_ptr = __a_barrier_kuser;
> > > > +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > >       if (__hwcap & HWCAP_TLS) {
> > > >               size_t *aux;
> > > >               __a_cas_ptr = __a_cas_v7;
> > >
> > > HWCAP needs to be honored here. You also need to add the new logic to
> > > use the syscall for getting thread pointer, I think.
> >
> > I updated this to use the "get_tls" syscall (0xf0006) for M-Profile
> > devices. The barrier and cas pointers point to the default dummy
> > functions.
> >
> > I'm not entirely sure how I should handle HWCAP. What should I do if
> > TLS is or isn't set? If it isn't set, the original code uses those
> > 'kuser' addresses that you pointed out aren't usable and will bail out
> > if the version at the special address isn't correct. Right now, I have
> > mine set to call a_crash() if TLS is not set in HWCAP because that
> > seems to be the closest I can come to the current functionality.
>
> The current HWCAP_TLS handling is for *all* pre-v7 ISA levels. By
> convention its meaning is presence of the cp15 thread pointer, but
> presence of this also implies atomics can be used directly (and
> absence of this means the kernel must be providing kuser_helper, so we
> just use it for atomics too).
>
> With M profile support, though, AIUI it's possible that you have the
> atomics but not the thread pointer. You should not assume that lack of
> HWCAP_TLS implies lack of the atomics; rather you just have to assume
> the atomics are present, I think, or use some other means of detection
> with fallback to interrupt masking (assuming these devices have no
> kernel/user distinction that prevents you from masking interrupts).
> HWCAP_TLS should probably be probed just so you don't assume the
> syscall exists in case a system omits it and does trap-and-emulate for
> the instruction instead.

I think I'm starting to understand this, which is good because it's
looking like my startup code for the micros will need to properly set
HWCAP before Musl can be used. I assume I'll need to set that
'aux{"AT_PLATFORM"}' to "v6" or "v7" as well to make this runtime
detection work properly. I'll have to figure out if "v6m" and "v7m"
are supported values for the platform. I may have more questions in
the future as I try actually implementing something.

Like I mentioned above, I can make a routine to temporarily mask
interrupts that could be used for v6-m or as a fallback/default for
other M-Profile devices; however, this only works properly if the
application is running in a privileged mode like one might do on bare
metal or an RTOS (not all v6-m devices have an MPU and so not all have
a concept of privilege anyway). The original code looks like it needs
to handle the case of one running a Musl built for ARMv6 on something
newer like ARMv7-A (I assume that's why there's "v7" versions of
functions in atomics.s). I'll try to follow that as well.

> > > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > > > index 29c2d07b..fe0406e3 100644
> > > > --- a/src/thread/arm/__unmapself.s
> > > > +++ b/src/thread/arm/__unmapself.s
> > > > @@ -3,7 +3,7 @@
> > > >  .global __unmapself
> > > >  .type   __unmapself,%function
> > > >  __unmapself:
> > > > -     mov r7,#91
> > > > +     movs r7,#91
> > > >       svc 0
> > > > -     mov r7,#1
> > > > +     movs r7,#1
> > > >       svc 0
> > >
> > > OK, but this file can't be used on nommu anyway. Instead the generic C
> > > version has to be used.
> >
> > LIke memcpy above (and the stuffi in src/math/arm), I made this a C
> > file that will either have this code as inline assembly or will
> > include the generic C version. The diff didn't capture this, should I
> > attach this separately?
>
> You can see how it's done in src/thread/sh for nommu support. Although
> I'd actually prefer moving the asm to inline in a .c file. If you do
> this, use a real C function with inline asm (note: you can't break it
> into multiple asm blocks since there's no stack after the first
> syscall and thus C code can't run) rather than stuffing the asm
> function in a top-level __asm__ statement.

Gotcha.

> > > > diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> > > > index da50508d..d49809d9 100644
> > > > --- a/src/thread/arm/atomics.S
> > > > +++ b/src/thread/arm/atomics.S
> > > > @@ -1,3 +1,5 @@
> > > > +#if __ARM_ARCH_ISA_ARM
> > > > +
> > > >  .syntax unified
> > > >  .text
> > > >
> > > > @@ -104,3 +106,5 @@ __a_cas_ptr:
> > > >  .hidden __a_gettp_ptr
> > > >  __a_gettp_ptr:
> > > >       .word __a_gettp_cp15
> > > > +
> > > > +#endif
> > >
> > > This file should not be conditional. If it can't be assembled in
> > > default mode it just needs ISA level override directives. Which
> > > functions are actually called is selected at runtime.
> >
> > I removed the condition so that everything in here is always built and
> > added directives as needed.
> >
> > I did run into an issue, though. Is there a way to restore the default
> > architecture after setting it using ".arch <arch>"? Something like
> > ".set push" and ".set pop" in MIPS assembly? I didn't realize what the
> > "scope" of those .arch directives were until it occurred to me that
> > one of the functions shouldn't have built for ARMv6-M. I rearranged
> > the functions so that the ones without .arch directives come first,
> > which did take care of the problem for now. Trying ".arch" by itself
> > gets me an error from Clang.
>
> Look in setjmp.s at the .fpu and .eabi_attribute usage. I'm not sure
> it's entirely the same but should be analogous. Basically you just
> want to override whatever ISA-level/ABI tags the assembler tries to
> put on the file.
>
> Rich

I get it now, thanks. I'll probably have a third attempt at a patch
after the holidays.

-Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-18 20:34     ` Rich Felker
@ 2020-12-22  0:00       ` Jesse DeGuire
  2020-12-22  1:40         ` Rich Felker
  0 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2020-12-22  0:00 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann

On Fri, Dec 18, 2020 at 3:35 PM Rich Felker <dalias@libc.org> wrote:
>
> On Fri, Dec 18, 2020 at 08:38:49PM +0100, Markus Wichmann wrote:
> > On Wed, Dec 16, 2020 at 07:23:49PM -0500, Rich Felker wrote:
> > > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself..s
> > > > index 29c2d07b..fe0406e3 100644
> > > > --- a/src/thread/arm/__unmapself.s
> > > > +++ b/src/thread/arm/__unmapself.s
> > > > @@ -3,7 +3,7 @@
> > > >  .global __unmapself
> > > >  .type   __unmapself,%function
> > > >  __unmapself:
> > > > - mov r7,#91
> > > > + movs r7,#91
> > > >   svc 0
> > > > - mov r7,#1
> > > > + movs r7,#1
> > > >   svc 0
> > >
> > > OK, but this file can't be used on nommu anyway. Instead the generic C
> > > version has to be used.
> > >
> >
> > Speaking of that (C) version, maybe someone should note down somewhere
> > that it is only safe to call if simultaneous calls from other threads
> > are precluded somehow. That function uses global variables, so if
> > multiple threads call it at the same time, there will be blood (well,
> > there will be clobbering in any case). Which is fulfilled at the moment:
> > The only call site of __unmapself() is pthread_exit(), after getting the
> > thread list lock. That is a global lock, so further threads also
> > exitting would have to queue up there. But I think it is rather
> > non-intuitive that the thread-list lock happens to also protect the
> > global variables in __unmapself.c. And that is a property of the C
> > version not shared by the assembler versions, which are all thread-safe.
>
> Prior to commit 8f11e6127fe93093f81a52b15bb1537edc3fc8af,
> __unmapself.c had its own lock and had to explicitly modify the exit
> futex address to unlock that lock. Now it can (and must, since neither
> can be unlocked before the kernel-level task exits) share the thread
> list lock for this purpose.
>
> FWIW there's arguably no reason to even keep the asm versions of this
> function now that the thread list lock exists. We could just remove
> them all.

Should I just go ahead and remove this so the generic C version is
used instead since I'm already here?

-Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-21 23:58       ` Jesse DeGuire
@ 2020-12-22  1:39         ` Rich Felker
  0 siblings, 0 replies; 17+ messages in thread
From: Rich Felker @ 2020-12-22  1:39 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl

On Mon, Dec 21, 2020 at 06:58:47PM -0500, Jesse DeGuire wrote:
> On Fri, Dec 18, 2020 at 12:30 PM Rich Felker <dalias@libc.org> wrote:
> > If it lacks LDREX and STREX how do you implement atomic? I don't see
> > where you're adding any alternative, so is v6-m support
> > non-functional? That rather defeats the purpose of doing anything to
> > support it...
> 
> Correct, I haven't yet added an alternative. Arm's answer--and what we
> generally do in the embedded world--is to disable interrupts using
> "cpsid", do your thing, then re-enable interrupts with "cpsie". This
> could be done with a new "__a_cas_v6m" variant that I'd add to
> atomics.s. This still won't work for Linux because the "cps(ie|id)"
> instruction is effectively a no-op if it is executed in an
> unprivileged context (meaning you can't trap and emulate it). You'd be
> looking at another system call if you really wanted v6-m Linux. That
> said, this could let Musl work on v6-m in a bare metal or RTOS
> environment, which I think Musl would be great for, and so I'd still
> work on adding support for it. Also, not all v6-m devices support a
> privilege model and run as though everything is privileged.
> ARMv8-M.base is similar to v6-m with LDREX and STREX and so that could
> have full support.

I'm not sure what the right answer for this is and whether it makes
support suitable for upstream or not at this point. We should probably
investigate further. If LDREX/STREX are trappable we could just use
them and require the kernel to trap and emulate but that's kinda ugly
(llsc-type atomics are much harder to emulate correctly than a cas
primitive).

> > > > > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > > > > index 3ae133c9..1843d97d 100644
> > > > > --- a/src/ldso/arm/tlsdesc.S
> > > > > +++ b/src/ldso/arm/tlsdesc.S
> > > > > @@ -12,13 +12,19 @@ __tlsdesc_static:
> > > > >  .hidden __tlsdesc_dynamic
> > > > >  .type __tlsdesc_dynamic,%function
> > > > >  __tlsdesc_dynamic:
> > > > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > > >       push {r2,r3,ip,lr}
> > > > > +#else
> > > > > +     push {r2-r4,lr}
> > > > > +     mov r2,ip
> > > > > +     push {r2}
> > > > > +#endif
> > > >
> > > > AFAICT ip is not special here. If it's a pain to push/pop, just use a
> > > > different register.
> > >
> > > Gotcha. It looks like there isn't really a good reason for the
> > > original to use IP, either, so I could change that and merge a couple
> > > of those alternate code paths. If you don't mind a few extra
> > > instructions, I think I can get rid of all of the (__thumb2__ ||
> > > !__thumb__) checks in there.
> >
> > This is an extreme hot path so it should probably be optimized with
> > preprocessor where it helps, but that doesn't mean it should have
> > gratuitous PP branches just to pick which register to use when a
> > common one that works everywhere can be chosen.
> 
> Perhaps using ip in the original code is to save pushing another
> register to memory so I won't mess with it since this is a hot path.
> 
> Does this function need to push r2 and r3 and pop them at the end? My
> understanding is that the Arm procedure call standard does not require
> subroutines to preserve r0-r3 unless their contents are needed across
> a function call that may clobber them. The r3 register isn't used here
> until after the call to __a_gettp_ptr. The r2 register is used across
> the call, but code does not make an attempt to preserve it before the
> call or restore it after. __a_getttp_ptr shouldn't mess with it
> anyway. The default version just returns a CP15 register and the
> Cortex-M version will trigger a syscall, which is treated like other
> interrupts and r0-r3+ip are automatically stacked and unstacked on
> Cortex-M interrupts.

Both the TLSDESC resolver function and __aeabi_read_tp (which
__a_gettp_ptr needs to be able to act as a backend for) have special
ABI requirements severely restricting what they can clobber. I don't
recall the exact details right off but it's pretty much "nothing"
except the return value.

> > With M profile support, though, AIUI it's possible that you have the
> > atomics but not the thread pointer. You should not assume that lack of
> > HWCAP_TLS implies lack of the atomics; rather you just have to assume
> > the atomics are present, I think, or use some other means of detection
> > with fallback to interrupt masking (assuming these devices have no
> > kernel/user distinction that prevents you from masking interrupts).
> > HWCAP_TLS should probably be probed just so you don't assume the
> > syscall exists in case a system omits it and does trap-and-emulate for
> > the instruction instead.
> 
> I think I'm starting to understand this, which is good because it's
> looking like my startup code for the micros will need to properly set
> HWCAP before Musl can be used. I assume I'll need to set that
> 'aux{"AT_PLATFORM"}' to "v6" or "v7" as well to make this runtime
> detection work properly. I'll have to figure out if "v6m" and "v7m"
> are supported values for the platform. I may have more questions in
> the future as I try actually implementing something.

Yes that sounds right. There are other aux vector entries that have to
be set correctly too for startup code, particularly AT_PHDR for
__init_tls to find the program headers (and for dl_iterate_phdr to
work). On some archs AT_HWCAP and AT_PLATFORM are also needed for
detection of features. AT_MINSIGSTKSZ is needed if the signal frame
size is variable and may exceed the default one defined in the macro.
AT_RANDOM is desirable for hardening but not mandatory. AT_EXECFN is
used as a fallback for program_invocation_name if auxv[0] is missing.
AT_SYSINFO_EHDR is used to offer vdso but is optional. And AT_*ID and
AT_SECURE are used to control behavior under suid (not trust
environment, etc.).

> Like I mentioned above, I can make a routine to temporarily mask
> interrupts that could be used for v6-m or as a fallback/default for
> other M-Profile devices; however, this only works properly if the
> application is running in a privileged mode like one might do on bare
> metal or an RTOS (not all v6-m devices have an MPU and so not all have
> a concept of privilege anyway). The original code looks like it needs
> to handle the case of one running a Musl built for ARMv6 on something
> newer like ARMv7-A (I assume that's why there's "v7" versions of
> functions in atomics.s). I'll try to follow that as well.

We should try to figure out if there's a reliable way to do this whose
failure mode isn't silently doing the wrong thing.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-22  0:00       ` Jesse DeGuire
@ 2020-12-22  1:40         ` Rich Felker
  0 siblings, 0 replies; 17+ messages in thread
From: Rich Felker @ 2020-12-22  1:40 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl, Markus Wichmann

On Mon, Dec 21, 2020 at 07:00:02PM -0500, Jesse DeGuire wrote:
> On Fri, Dec 18, 2020 at 3:35 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Fri, Dec 18, 2020 at 08:38:49PM +0100, Markus Wichmann wrote:
> > > On Wed, Dec 16, 2020 at 07:23:49PM -0500, Rich Felker wrote:
> > > > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself..s
> > > > > index 29c2d07b..fe0406e3 100644
> > > > > --- a/src/thread/arm/__unmapself.s
> > > > > +++ b/src/thread/arm/__unmapself.s
> > > > > @@ -3,7 +3,7 @@
> > > > >  .global __unmapself
> > > > >  .type   __unmapself,%function
> > > > >  __unmapself:
> > > > > - mov r7,#91
> > > > > + movs r7,#91
> > > > >   svc 0
> > > > > - mov r7,#1
> > > > > + movs r7,#1
> > > > >   svc 0
> > > >
> > > > OK, but this file can't be used on nommu anyway. Instead the generic C
> > > > version has to be used.
> > > >
> > >
> > > Speaking of that (C) version, maybe someone should note down somewhere
> > > that it is only safe to call if simultaneous calls from other threads
> > > are precluded somehow. That function uses global variables, so if
> > > multiple threads call it at the same time, there will be blood (well,
> > > there will be clobbering in any case). Which is fulfilled at the moment:
> > > The only call site of __unmapself() is pthread_exit(), after getting the
> > > thread list lock. That is a global lock, so further threads also
> > > exitting would have to queue up there. But I think it is rather
> > > non-intuitive that the thread-list lock happens to also protect the
> > > global variables in __unmapself.c. And that is a property of the C
> > > version not shared by the assembler versions, which are all thread-safe.
> >
> > Prior to commit 8f11e6127fe93093f81a52b15bb1537edc3fc8af,
> > __unmapself.c had its own lock and had to explicitly modify the exit
> > futex address to unlock that lock. Now it can (and must, since neither
> > can be unlocked before the kernel-level task exits) share the thread
> > list lock for this purpose.
> >
> > FWIW there's arguably no reason to even keep the asm versions of this
> > function now that the thread list lock exists. We could just remove
> > them all.
> 
> Should I just go ahead and remove this so the generic C version is
> used instead since I'm already here?

Keep what you've got for now. If we decide to remove it globally it's
trivial to merge that change, but the removal is logically unrelated
to the M-profile work and will be done as a separate commit if it's
done.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-18  7:17       ` Jesse DeGuire
@ 2020-12-22 21:43         ` Patrick Oppenlander
  2021-01-06  3:24           ` Jesse DeGuire
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Oppenlander @ 2020-12-22 21:43 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl

On Fri, Dec 18, 2020 at 6:17 PM Jesse DeGuire <jesse.a.deguire@gmail.com> wrote:
>
> On Thu, Dec 17, 2020 at 12:10 AM Patrick Oppenlander
> <patrick.oppenlander@gmail.com> wrote:
> >
> > On Thu, Dec 17, 2020 at 3:55 PM Patrick Oppenlander
> > <patrick.oppenlander@gmail.com> wrote:
> > >
> > > On Thu, Dec 17, 2020 at 11:24 AM Rich Felker <dalias@libc.org> wrote:
> > > >
> > > > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > > > Hey everyone,
> > > > >
> > > > > I'm working on putting together a Clang-based toolchain to use with
> > > > > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > > > > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > > > > the C library. Currently, I'm trying to get it to build for a few
> > > > > different Cortex-M devices and have found that Musl builds fine for
> > > > > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > > > > instructions not supported on the Thumb ISA subset used by v6-M and
> > > > > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > > > > can easily switch to HEAD if needed. I am using a Python script to
> > > > > build Musl (and eventually the rest of the toolchain), which you can
> > > > > see on GitHub at the following link. It's a bit of a mess at the
> > > > > moment, but the build_musl() function is what I'm currently using to
> > > > > build Musl.
> > > >
> > > > I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> > > > it is, let's have a look.
> > > >
> > > > > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> > > > >
> > > > > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > > > > Baseline and have attached a diff to this email. If you like, I can go
> > > > > into more detail as to why I made the changes I made; however, many
> > > > > changes were merely the result of my attempts to correct errors
> > > > > reported by Clang due to it encountering instruction sequences not
> > > > > supported on ARMv6-M.
> > > >
> > > > Are there places where clang's linker is failing to make substitutions
> > > > that the GNU one can do, that would make this simpler? For example I
> > > > know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> > > > relocation the assembler emits on the insn).
> > > >
> > > > > A number of errors were simply the result of
> > > > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > > > files I had to change from a "lower case s" to a "capital-S" file so
> > > > > that I could use macros to check for either the Thumb1 ISA
> > > > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > > > ("!__ARM_ARCH_ISA_ARM").
> > > >
> > > > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > > > If not this may need an alternate detection. But I'd like to drop as
> > > > much as possible and just make the code compatible rather than having
> > > > 2 versions of it. I don't think there are any places where the
> > > > performance or size is at all relevant.
> > > >
> > > > > The changes under
> > > > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > > > those because I don't believe Cortex-M devices could handle what was
> > > > > there (no M-Profile device has Coprocessor 15, for example) and so I
> > > >
> > > > Unless this is an ISA level that can't be executed on a normal (non-M)
> > > > ARM profile, it still needs all the backends that might be needed and
> > > > runtime selection of which to use. This is okay. I believe Linux for
> > > > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > > > needs to be added as a backend. The right way to do this would have
> > > > been with trap-and-emulate (of cp15) I think...
> > >
> > > Linux emulates mrc 15 on old -A cores but they decided not to on -M
> > > for some reason. BTW, the syscall is called get_tls.
> > >
> > > Is there any option other than supporting the get_tls syscall? Even if
> > > someone puts in the effort to wire up the trap-and-emulate backend,
> > > musl linked executables will still only run on new kernels.
> > >
> > > I took the trap-and-emulate approach in Apex RTOS to avoid opening
> > > this can of worms. It's the only missing link for musl on armv7-m.
> > > Everything else works beautifully.
> >
> > Another consideration is qemu-user: Currently it aborts when
> > encountering an mrc 15 instruction while emulating armv7-m. I guess
> > that would probably also be solved by supporting the syscall.
> >
> > Patrick
>
> ARMv6-M and v8-M.base do not support the MRC instruction at all. Could
> that play into why Linux and qemu bail?
>
> Jesse

Sorry, I missed this reply.

qemu-user refuses to translate the instruction because cp15 is not
implemented on armv7-m, exactly the same issue as is being discussed
here. If you run the same executable but tell qemu to emulate an A
profile core instead it happily runs it.

Linux will probably kill the executable with SIGILL or something like
that (I haven't tried, just guessing).

It's related to this discussion as changing musl to use the syscall
will likely result in qemu-user working too.

I would personally prefer to see a solution which doesn't use the
syscall. It's possible to implement the trap-and-emulate much more
efficiently than the syscall as it can quite easily be done without
preserving any more registers than the core pushes on exception entry
anyway. https://github.com/apexrtos/apex/blob/master/sys/arch/arm/v7m/emulate.S
is what I came up with. That implementation could be even tighter as
it can never run from handler mode, so the stack detection at the
beginning is unnecessary. However, I haven't considered v6-m or v8-m.

trap-and-emulate also gracefully degrades when running the same
executable on A vs M cores.

Patrick

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2020-12-22 21:43         ` Patrick Oppenlander
@ 2021-01-06  3:24           ` Jesse DeGuire
  2021-01-06  4:01             ` Patrick Oppenlander
  0 siblings, 1 reply; 17+ messages in thread
From: Jesse DeGuire @ 2021-01-06  3:24 UTC (permalink / raw)
  To: Patrick Oppenlander; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 11397 bytes --]

Hi everyone,

Here's attempt three at Musl on Arm M-profile devices. I think I've
incorporated all of the suggestions in this email thread. Let me know
your thoughts.

More specific responses are below.

On Mon, Dec 21, 2020 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
>
> On Mon, Dec 21, 2020 at 06:58:47PM -0500, Jesse DeGuire wrote:
> > On Fri, Dec 18, 2020 at 12:30 PM Rich Felker <dalias@libc.org> wrote:
> > > If it lacks LDREX and STREX how do you implement atomic? I don't see
> > > where you're adding any alternative, so is v6-m support
> > > non-functional? That rather defeats the purpose of doing anything to
> > > support it...
> >
> > Correct, I haven't yet added an alternative. Arm's answer--and what we
> > generally do in the embedded world--is to disable interrupts using
> > "cpsid", do your thing, then re-enable interrupts with "cpsie". This
> > could be done with a new "__a_cas_v6m" variant that I'd add to
> > atomics.s. This still won't work for Linux because the "cps(ie|id)"
> > instruction is effectively a no-op if it is executed in an
> > unprivileged context (meaning you can't trap and emulate it). You'd be
> > looking at another system call if you really wanted v6-m Linux. That
> > said, this could let Musl work on v6-m in a bare metal or RTOS
> > environment, which I think Musl would be great for, and so I'd still
> > work on adding support for it. Also, not all v6-m devices support a
> > privilege model and run as though everything is privileged.
> > ARMv8-M.base is similar to v6-m with LDREX and STREX and so that could
> > have full support.
>
> I'm not sure what the right answer for this is and whether it makes
> support suitable for upstream or not at this point. We should probably
> investigate further. If LDREX/STREX are trappable we could just use
> them and require the kernel to trap and emulate but that's kinda ugly
> (llsc-type atomics are much harder to emulate correctly than a cas
> primitive).

LDREX/STREX should cause a fault on v6m that can be trapped and
handled. It's unfortunate that CPSID/CPSIE are ignored because
trapping on those seem like they'd be easier for whoever has to handle
them.

My current solution is to use the HWCAP_TLS flag on M-profile devices
to indicate how to handle this. If it's set, then the code will use
STREX, LDREX, and MRC with the assumption that they will be trapped
and emulated. If the flag is cleared, then the code will use the ARM
get_tls syscall instead of MRC and will use interrupt masking only if
the platform (aux{AT_PLATFORM]) indicates a "v6" device.

This does mean that I'm overloading an already overloaded flag, but
I'm not yet sure how else to handle this. I can easily not set the
flag in my bare metal environment and get the behavior I want, but I'm
not sure if this is sufficient for nommu Linux users.

> > > With M profile support, though, AIUI it's possible that you have the
> > > atomics but not the thread pointer. You should not assume that lack of
> > > HWCAP_TLS implies lack of the atomics; rather you just have to assume
> > > the atomics are present, I think, or use some other means of detection
> > > with fallback to interrupt masking (assuming these devices have no
> > > kernel/user distinction that prevents you from masking interrupts).
> > > HWCAP_TLS should probably be probed just so you don't assume the
> > > syscall exists in case a system omits it and does trap-and-emulate for
> > > the instruction instead.
> >
> > I think I'm starting to understand this, which is good because it's
> > looking like my startup code for the micros will need to properly set
> > HWCAP before Musl can be used. I assume I'll need to set that
> > 'aux{"AT_PLATFORM"}' to "v6" or "v7" as well to make this runtime
> > detection work properly. I'll have to figure out if "v6m" and "v7m"
> > are supported values for the platform. I may have more questions in
> > the future as I try actually implementing something.
>
> Yes that sounds right. There are other aux vector entries that have to
> be set correctly too for startup code, particularly AT_PHDR for
> __init_tls to find the program headers (and for dl_iterate_phdr to
> work). On some archs AT_HWCAP and AT_PLATFORM are also needed for
> detection of features. AT_MINSIGSTKSZ is needed if the signal frame
> size is variable and may exceed the default one defined in the macro.
> AT_RANDOM is desirable for hardening but not mandatory. AT_EXECFN is
> used as a fallback for program_invocation_name if auxv[0] is missing.
> AT_SYSINFO_EHDR is used to offer vdso but is optional. And AT_*ID and
> AT_SECURE are used to control behavior under suid (not trust
> environment, etc.).

Good to know, thanks.

On Tue, Dec 22, 2020 at 4:44 PM Patrick Oppenlander
<patrick.oppenlander@gmail.com> wrote:
>
> On Fri, Dec 18, 2020 at 6:17 PM Jesse DeGuire <jesse.a.deguire@gmail.com> wrote:
> >
> > On Thu, Dec 17, 2020 at 12:10 AM Patrick Oppenlander
> > <patrick.oppenlander@gmail.com> wrote:
> > >
> > > On Thu, Dec 17, 2020 at 3:55 PM Patrick Oppenlander
> > > <patrick.oppenlander@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 17, 2020 at 11:24 AM Rich Felker <dalias@libc.org> wrote:
> > > > >
> > > > > On Wed, Dec 16, 2020 at 06:43:15PM -0500, Jesse DeGuire wrote:
> > > > > > Hey everyone,
> > > > > >
> > > > > > I'm working on putting together a Clang-based toolchain to use with
> > > > > > Microchip PIC32 (MIPS32) and SAM (Cortex-M) microcontrollers as an
> > > > > > alternative to their paid XC32 toolchain and I'd like to use Musl as
> > > > > > the C library. Currently, I'm trying to get it to build for a few
> > > > > > different Cortex-M devices and have found that Musl builds fine for
> > > > > > ARMv7-M, but not for ARMv6-M or v8-M Baseline because Musl uses
> > > > > > instructions not supported on the Thumb ISA subset used by v6-M and
> > > > > > v8-M Baseline devices. I'm using the v1.2.1 tag version of Musl, but
> > > > > > can easily switch to HEAD if needed. I am using a Python script to
> > > > > > build Musl (and eventually the rest of the toolchain), which you can
> > > > > > see on GitHub at the following link. It's a bit of a mess at the
> > > > > > moment, but the build_musl() function is what I'm currently using to
> > > > > > build Musl.
> > > > >
> > > > > I had assumed the thumb1[-ish?] subset wouldn't be interesting, but if
> > > > > it is, let's have a look.
> > > > >
> > > > > > https://github.com/jdeguire/buildPic32Clang/blob/master/buildPic32Clang.py
> > > > > >
> > > > > > Anyway, I have managed to get Musl to build for v6-M, v7-M, and v8-M
> > > > > > Baseline and have attached a diff to this email. If you like, I can go
> > > > > > into more detail as to why I made the changes I made; however, many
> > > > > > changes were merely the result of my attempts to correct errors
> > > > > > reported by Clang due to it encountering instruction sequences not
> > > > > > supported on ARMv6-M.
> > > > >
> > > > > Are there places where clang's linker is failing to make substitutions
> > > > > that the GNU one can do, that would make this simpler? For example I
> > > > > know the GNU one can replace bx rn by mov pc,rn if needed (based on a
> > > > > relocation the assembler emits on the insn).
> > > > >
> > > > > > A number of errors were simply the result of
> > > > > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > > > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > > > > files I had to change from a "lower case s" to a "capital-S" file so
> > > > > > that I could use macros to check for either the Thumb1 ISA
> > > > > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > > > > ("!__ARM_ARCH_ISA_ARM").
> > > > >
> > > > > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > > > > If not this may need an alternate detection. But I'd like to drop as
> > > > > much as possible and just make the code compatible rather than having
> > > > > 2 versions of it. I don't think there are any places where the
> > > > > performance or size is at all relevant.
> > > > >
> > > > > > The changes under
> > > > > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > > > > those because I don't believe Cortex-M devices could handle what was
> > > > > > there (no M-Profile device has Coprocessor 15, for example) and so I
> > > > >
> > > > > Unless this is an ISA level that can't be executed on a normal (non-M)
> > > > > ARM profile, it still needs all the backends that might be needed and
> > > > > runtime selection of which to use. This is okay. I believe Linux for
> > > > > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > > > > needs to be added as a backend. The right way to do this would have
> > > > > been with trap-and-emulate (of cp15) I think...
> > > >
> > > > Linux emulates mrc 15 on old -A cores but they decided not to on -M
> > > > for some reason. BTW, the syscall is called get_tls.
> > > >
> > > > Is there any option other than supporting the get_tls syscall? Even if
> > > > someone puts in the effort to wire up the trap-and-emulate backend,
> > > > musl linked executables will still only run on new kernels.
> > > >
> > > > I took the trap-and-emulate approach in Apex RTOS to avoid opening
> > > > this can of worms. It's the only missing link for musl on armv7-m.
> > > > Everything else works beautifully.
> > >
> > > Another consideration is qemu-user: Currently it aborts when
> > > encountering an mrc 15 instruction while emulating armv7-m. I guess
> > > that would probably also be solved by supporting the syscall.
> > >
> > > Patrick
> >
> > ARMv6-M and v8-M.base do not support the MRC instruction at all. Could
> > that play into why Linux and qemu bail?
> >
> > Jesse
>
> Sorry, I missed this reply.
>
> qemu-user refuses to translate the instruction because cp15 is not
> implemented on armv7-m, exactly the same issue as is being discussed
> here. If you run the same executable but tell qemu to emulate an A
> profile core instead it happily runs it.
>
> Linux will probably kill the executable with SIGILL or something like
> that (I haven't tried, just guessing).
>
> It's related to this discussion as changing musl to use the syscall
> will likely result in qemu-user working too.
>
> I would personally prefer to see a solution which doesn't use the
> syscall. It's possible to implement the trap-and-emulate much more
> efficiently than the syscall as it can quite easily be done without
> preserving any more registers than the core pushes on exception entry
> anyway. https://github.com/apexrtos/apex/blob/master/sys/arch/arm/v7m/emulate.S
> is what I came up with. That implementation could be even tighter as
> it can never run from handler mode, so the stack detection at the
> beginning is unnecessary. However, I haven't considered v6-m or v8-m.
>
> trap-and-emulate also gracefully degrades when running the same
> executable on A vs M cores.
>
> Patrick

Any thoughts on what's shown in this patch? For your RTOS and v7m/v8m,
I'm thinking you'd be able to get the behavior you want by setting the
HWCAP_TLS flag early in your startup code. For my purposes, I plan to
use the syscall because I intend to eventually make a "baremetal" arch
in Musl that turns syscalls into simple function calls. Therefore, I'd
clear the flag in my startup code.

-Jesse

[-- Attachment #2: musl_cortexm_v3.diff --]
[-- Type: application/octet-stream, Size: 14402 bytes --]

diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
index 9e3937cc..54b743bb 100644
--- a/arch/arm/atomic_arch.h
+++ b/arch/arm/atomic_arch.h
@@ -27,16 +27,6 @@ static inline int a_sc(volatile int *p, int v)
 	return !r;
 }
 
-#if __ARM_ARCH_7A__ || __ARM_ARCH_7R__ ||  __ARM_ARCH >= 7
-
-#define a_barrier a_barrier
-static inline void a_barrier()
-{
-	__asm__ __volatile__ ("dmb ish" : : : "memory");
-}
-
-#endif
-
 #define a_pre_llsc a_barrier
 #define a_post_llsc a_barrier
 
@@ -62,13 +52,22 @@ static inline int a_cas(volatile int *p, int t, int s)
 
 #endif
 
-#ifndef a_barrier
 #define a_barrier a_barrier
+#if __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7 || __ARM_ARCH_PROFILE == 'M'
+
+static inline void a_barrier()
+{
+	__asm__ __volatile__ ("dmb ish" : : : "memory");
+}
+
+#else
+
 static inline void a_barrier()
 {
 	register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
 	__asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
 }
+
 #endif
 
 #define a_crash a_crash
diff --git a/arch/arm/crt_arch.h b/arch/arm/crt_arch.h
index 99508b1d..66080422 100644
--- a/arch/arm/crt_arch.h
+++ b/arch/arm/crt_arch.h
@@ -3,13 +3,15 @@ __asm__(
 ".global " START " \n"
 ".type " START ",%function \n"
 START ": \n"
-"	mov fp, #0 \n"
-"	mov lr, #0 \n"
+"	movs a3, #0 \n"
+"	mov fp, a3 \n"
+"	mov lr, a3 \n"
 "	ldr a2, 1f \n"
 "	add a2, pc, a2 \n"
 "	mov a1, sp \n"
-"2:	and ip, a1, #-16 \n"
-"	mov sp, ip \n"
+"2:	subs a3, #16 \n"
+"	ands a1, a3 \n"
+"	mov sp, a1 \n"
 "	bl " START "_c \n"
 ".weak _DYNAMIC \n"
 ".hidden _DYNAMIC \n"
diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
index e689ea21..9155b9a4 100644
--- a/arch/arm/pthread_arch.h
+++ b/arch/arm/pthread_arch.h
@@ -1,5 +1,5 @@
 #if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || (__ARM_ARCH >= 7 && __ARM_ARCH_PROFILE != 'M')
 
 static inline pthread_t __pthread_self()
 {
diff --git a/crt/arm/crtn.s b/crt/arm/crtn.s
index dc020f92..547e64b7 100644
--- a/crt/arm/crtn.s
+++ b/crt/arm/crtn.s
@@ -1,9 +1,9 @@
 .syntax unified
 
 .section .init
-	pop {r0,lr}
-	bx lr
+	pop {r0,r1}
+	bx r1
 
 .section .fini
-	pop {r0,lr}
-	bx lr
+	pop {r0,r1}
+	bx r1
diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
index 3ae133c9..33216200 100644
--- a/src/ldso/arm/tlsdesc.S
+++ b/src/ldso/arm/tlsdesc.S
@@ -12,13 +12,13 @@ __tlsdesc_static:
 .hidden __tlsdesc_dynamic
 .type __tlsdesc_dynamic,%function
 __tlsdesc_dynamic:
-	push {r2,r3,ip,lr}
+	push {r2,r3,r4,lr}
 	ldr r1,[r0]
 	ldr r2,[r1,#4]  // r2 = offset
 	ldr r1,[r1]     // r1 = modid
 
 #if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || (__ARM_ARCH >= 7 && __ARM_ARCH_PROFILE != 'M')
 	mrc p15,0,r0,c13,c0,3
 #else
 	ldr r0,1f
@@ -36,19 +36,28 @@ __tlsdesc_dynamic:
 	bx r0
 #endif
 #endif
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	ldr r3,[r0,#-4] // r3 = dtv
-	ldr ip,[r3,r1,LSL #2]
-	sub r0,ip,r0
+	ldr r4,[r3,r1,LSL #2]
+	sub r0,r4,r0
+#else
+	mov r4,r0
+	subs r4,#4
+	ldr r3,[r4]
+	lsls r4,r1,#2
+	ldr r4,[r3,r4]
+	subs r0,r4,r0
+#endif
 	add r0,r0,r2    // r0 = r3[r1]-r0+r2
 #if __ARM_ARCH >= 5
-	pop {r2,r3,ip,pc}
+	pop {r2,r3,r4,pc}
 #else
-	pop {r2,r3,ip,lr}
+	pop {r2,r3,r4,lr}
 	bx lr
 #endif
 
 #if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
- || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
+ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || (__ARM_ARCH >= 7 && __ARM_ARCH_PROFILE != 'M')
 #else
 	.align 2
 1:	.word __a_gettp_ptr - 2b
diff --git a/src/process/arm/vfork.s b/src/process/arm/vfork.s
index d7ec41b3..b6f0260e 100644
--- a/src/process/arm/vfork.s
+++ b/src/process/arm/vfork.s
@@ -3,7 +3,7 @@
 .type vfork,%function
 vfork:
 	mov ip, r7
-	mov r7, 190
+	movs r7, 190
 	svc 0
 	mov r7, ip
 	.hidden __syscall_ret
diff --git a/src/setjmp/arm/longjmp.S b/src/setjmp/arm/longjmp.S
index 8df0b819..a2641b92 100644
--- a/src/setjmp/arm/longjmp.S
+++ b/src/setjmp/arm/longjmp.S
@@ -7,16 +7,32 @@ _longjmp:
 longjmp:
 	mov ip,r0
 	movs r0,r1
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	moveq r0,#1
 	ldmia ip!, {v1,v2,v3,v4,v5,v6,sl,fp}
 	ldmia ip!, {r2,lr}
 	mov sp,r2
-
+#else
+	bne 4f
+	movs r0,#1
+4:	mov r1,ip
+	adds r1,#16
+	ldmia r1!, {r2-r7}
+	mov lr,r7
+	mov sp,r6
+	mov r11,r5
+	mov r10,r4
+	mov r9,r3
+	mov r8,r2
+	mov ip,r1
+	subs r1,#40
+	ldmia r1!, {r4-r7}
+#endif
 	adr r1,1f
 	ldr r2,1f
 	ldr r1,[r1,r2]
 
-#if __ARM_ARCH < 8
+#if __ARM_ARCH_PROFILE != 'M' && __ARM_ARCH < 8
 	tst r1,#0x260
 	beq 3f
 	// HWCAP_ARM_FPA
@@ -24,14 +40,15 @@ longjmp:
 	beq 2f
 	ldc p2, cr4, [ip], #48
 #endif
-2:	tst r1,#0x40
+2:	movs r2,#0x40
+	tst r1,r2
 	beq 2f
 	.fpu vfp
 	vldmia ip!, {d8-d15}
 	.fpu softvfp
 	.eabi_attribute 10, 0
 	.eabi_attribute 27, 0
-#if __ARM_ARCH < 8
+#if __ARM_ARCH_PROFILE != 'M' && __ARM_ARCH < 8
 	// HWCAP_ARM_IWMMXT
 2:	tst r1,#0x200
 	beq 3f
diff --git a/src/setjmp/arm/setjmp.S b/src/setjmp/arm/setjmp.S
index 45731d22..7ca51886 100644
--- a/src/setjmp/arm/setjmp.S
+++ b/src/setjmp/arm/setjmp.S
@@ -8,17 +8,28 @@
 __setjmp:
 _setjmp:
 setjmp:
+#if defined(__thumb2__)  ||  !defined(__thumb__)
 	mov ip,r0
 	stmia ip!,{v1,v2,v3,v4,v5,v6,sl,fp}
 	mov r2,sp
 	stmia ip!,{r2,lr}
-	mov r0,#0
-
+#else
+	stmia r0!,{r4-r7}
+	mov r1,r8
+	mov r2,r9
+	mov r3,r10
+	mov r4,r11
+	mov r5,sp
+	mov r6,lr
+	stmia r0!,{r1-r6}
+	mov ip,r0
+#endif
+	movs r0,#0
 	adr r1,1f
 	ldr r2,1f
 	ldr r1,[r1,r2]
 
-#if __ARM_ARCH < 8
+#if __ARM_ARCH_PROFILE != 'M' && __ARM_ARCH < 8
 	tst r1,#0x260
 	beq 3f
 	// HWCAP_ARM_FPA
@@ -26,14 +37,15 @@ setjmp:
 	beq 2f
 	stc p2, cr4, [ip], #48
 #endif
-2:	tst r1,#0x40
+2:	movs r2,#0x40
+	tst r1,r2
 	beq 2f
 	.fpu vfp
 	vstmia ip!, {d8-d15}
 	.fpu softvfp
 	.eabi_attribute 10, 0
 	.eabi_attribute 27, 0
-#if __ARM_ARCH < 8
+#if __ARM_ARCH_PROFILE != 'M' && __ARM_ARCH < 8
 	// HWCAP_ARM_IWMMXT
 2:	tst r1,#0x200
 	beq 3f
diff --git a/src/signal/arm/restore.s b/src/signal/arm/restore.s
index fb086d9b..2b7621b1 100644
--- a/src/signal/arm/restore.s
+++ b/src/signal/arm/restore.s
@@ -4,12 +4,12 @@
 .hidden __restore
 .type __restore,%function
 __restore:
-	mov r7,#119
+	movs r7,#119
 	swi 0x0
 
 .global __restore_rt
 .hidden __restore_rt
 .type __restore_rt,%function
 __restore_rt:
-	mov r7,#173
+	movs r7,#173
 	swi 0x0
diff --git a/src/signal/arm/sigsetjmp.s b/src/signal/arm/sigsetjmp.s
index 69ebbf49..8ef51de3 100644
--- a/src/signal/arm/sigsetjmp.s
+++ b/src/signal/arm/sigsetjmp.s
@@ -9,16 +9,20 @@ __sigsetjmp:
 	bne 1f
 	b setjmp
 
-1:	str lr,[r0,#256]
-	str r4,[r0,#260+8]
+1:	mov r2,lr
+	adds r0,#200
+	str r2,[r0,#56]
+	str r4,[r0,#60+8]
 	mov r4,r0
 
 	bl setjmp
 
 	mov r1,r0
 	mov r0,r4
-	ldr lr,[r0,#256]
-	ldr r4,[r0,#260+8]
+	ldr r2,[r0,#56]
+	mov lr,r2
+	ldr r4,[r0,#60+8]
+	subs r0,#200
 
 .hidden __sigsetjmp_tail
 	b __sigsetjmp_tail
diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
index 869e3448..2eb28eec 100644
--- a/src/string/arm/memcpy.S
+++ b/src/string/arm/memcpy.S
@@ -43,6 +43,8 @@
  * building as thumb 2 and big-endian.
  */
 
+#if defined(__thumb2__)  ||  !defined(__thumb__)
+
 .syntax unified
 
 .global memcpy
@@ -477,3 +479,4 @@ copy_last_3_and_return:
 	ldmfd   sp!, {r0, r4, lr}
 	bx      lr
 
+#endif /* defined(__thumb2__)  ||  !defined(__thumb__) */
diff --git a/src/string/arm/memcpy_thumb1.c b/src/string/arm/memcpy_thumb1.c
new file mode 100755
index 00000000..23571e00
--- /dev/null
+++ b/src/string/arm/memcpy_thumb1.c
@@ -0,0 +1,5 @@
+#if defined(__thumb__)  &&  !defined(__thumb2__)
+
+#include "../memcpy.c"
+
+#endif
\ No newline at end of file
diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
index 09de65aa..99ce5f41 100644
--- a/src/thread/arm/__set_thread_area.c
+++ b/src/thread/arm/__set_thread_area.c
@@ -6,27 +6,50 @@
 #define HWCAP_TLS (1 << 15)
 
 extern hidden const unsigned char
-	__a_barrier_oldkuser[], __a_barrier_v6[], __a_barrier_v7[],
-	__a_cas_v6[], __a_cas_v7[],
-	__a_gettp_cp15[];
+	__a_barrier_oldkuser[], __a_barrier_v6[], __a_barrier_v7[], __a_barrier_m[],
+	__a_cas_v6[], __a_cas_v7[], __a_cas_m[], __a_cas_intmask_m[],
+	__a_gettp_cp15[], __a_gettp_cp15_m[], __a_gettp_syscall_m[];
 
 #define __a_barrier_kuser 0xffff0fa0
 #define __a_barrier_oldkuser (uintptr_t)__a_barrier_oldkuser
 #define __a_barrier_v6 (uintptr_t)__a_barrier_v6
 #define __a_barrier_v7 (uintptr_t)__a_barrier_v7
+#define __a_barrier_m (uintptr_t)__a_barrier_m
 
 #define __a_cas_kuser 0xffff0fc0
 #define __a_cas_v6 (uintptr_t)__a_cas_v6
 #define __a_cas_v7 (uintptr_t)__a_cas_v7
+#define __a_cas_m (uintptr_t)__a_cas_m
+#define __a_cas_intmask_m (uintptr_t)__a_cas_intmask_m
 
 #define __a_gettp_kuser 0xffff0fe0
 #define __a_gettp_cp15 (uintptr_t)__a_gettp_cp15
+#define __a_gettp_cp15_m (uintptr_t)__a_gettp_cp15_m
+#define __a_gettp_syscall_m (uintptr_t)__a_gettp_syscall_m
 
 extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
 
 int __set_thread_area(void *p)
 {
-#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
+#if __ARM_ARCH_PROFILE == 'M'
+	__a_cas_ptr = __a_cas_m;
+	__a_barrier_ptr = __a_barrier_m;
+
+	if (__hwcap & HWCAP_TLS) {
+		__a_gettp_ptr = __a_gettp_cp15_m;
+	} else {
+		size_t *aux;
+		__a_gettp_ptr = __a_gettp_syscall_m;
+		for (aux=libc.auxv; *aux; aux+=2) {
+			if (*aux != AT_PLATFORM) continue;
+			const char *s = (void *)aux[1];
+			if (s[0]=='v' && s[1]=='6') {
+				__a_cas_ptr = __a_cas_intmask_m;
+				break;
+			}
+		}
+	}
+#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
 	if (__hwcap & HWCAP_TLS) {
 		size_t *aux;
 		__a_cas_ptr = __a_cas_v7;
diff --git a/src/thread/arm/__unmapself.c b/src/thread/arm/__unmapself.c
new file mode 100755
index 00000000..9afc4780
--- /dev/null
+++ b/src/thread/arm/__unmapself.c
@@ -0,0 +1,21 @@
+#if __ARM_ARCH_PROFILE != 'M'
+
+#include "pthread_impl.h"
+
+void __unmapself(void *base, size_t size)
+{
+	register void *r0 __asm__("r0") = base;
+	register size_t r1 __asm__("r1") = size;
+	__asm__ __volatile__ (
+	"	movs r7,#91 \n"
+	"	svc 0 \n"
+	"	movs r7,#1 \n"
+	"	svc 0 \n"
+	:: "r"(r0), "r"(r1));
+}
+
+#else
+
+#include "../__unmapself.c"
+
+#endif
diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
deleted file mode 100644
index 29c2d07b..00000000
--- a/src/thread/arm/__unmapself.s
+++ /dev/null
@@ -1,9 +0,0 @@
-.syntax unified
-.text
-.global __unmapself
-.type   __unmapself,%function
-__unmapself:
-	mov r7,#91
-	svc 0
-	mov r7,#1
-	svc 0
diff --git a/src/thread/arm/atomics.s b/src/thread/arm/atomics.s
index da50508d..900572dc 100644
--- a/src/thread/arm/atomics.s
+++ b/src/thread/arm/atomics.s
@@ -11,6 +11,8 @@ __a_barrier_dummy:
 .hidden __a_barrier_oldkuser
 .type __a_barrier_oldkuser,%function
 __a_barrier_oldkuser:
+	.arch armv6
+	.arm
 	push {r0,r1,r2,r3,ip,lr}
 	mov r1,r0
 	mov r2,sp
@@ -25,6 +27,7 @@ __a_barrier_oldkuser:
 .type __a_barrier_v6,%function
 __a_barrier_v6:
 	.arch armv6t2
+	.arm
 	mcr p15,0,r0,c7,c10,5
 	bx lr
 
@@ -33,24 +36,38 @@ __a_barrier_v6:
 .type __a_barrier_v7,%function
 __a_barrier_v7:
 	.arch armv7-a
+	.arm
 	dmb ish
 	bx lr
 
+.global __a_barrier_m
+.hidden __a_barrier_m
+.type __a_barrier_m,%function
+__a_barrier_m:
+	.arch armv6-m
+	.thumb
+	dmb
+	bx lr
+
 .global __a_cas_dummy
 .hidden __a_cas_dummy
 .type __a_cas_dummy,%function
 __a_cas_dummy:
+	.arch armv7-a
+	.arm
 	mov r3,r0
 	ldr r0,[r2]
 	subs r0,r3,r0
-	streq r1,[r2]
-	bx lr
+	bne 1f
+	str r1,[r2]
+1:	bx lr
 
 .global __a_cas_v6
 .hidden __a_cas_v6
 .type __a_cas_v6,%function
 __a_cas_v6:
 	.arch armv6t2
+	.arm
 	mov r3,r0
 	mcr p15,0,r0,c7,c10,5
 1:	ldrex r0,[r2]
@@ -66,6 +83,7 @@ __a_cas_v6:
 .type __a_cas_v7,%function
 __a_cas_v7:
 	.arch armv7-a
+	.arm
 	mov r3,r0
 	dmb ish
 1:	ldrex r0,[r2]
@@ -76,13 +94,72 @@ __a_cas_v7:
 	dmb ish
 	bx lr
 
+.global __a_cas_m
+.hidden __a_cas_m
+.type __a_cas_m,%function
+__a_cas_m:
+	.arch armv7-m
+	.thumb
+	mov r3,r0
+	dmb
+1:	ldrex r0,[r2]
+	subs r0,r3,r0
+	bne 1b
+	strex r0,r1,[r2]
+	tst r0,r0
+	bne 1b
+	dmb
+	bx lr
+
+.global __a_cas_intmask_m
+.hidden __a_cas_intmask_m
+.type __a_cas_intmask_m,%function
+__a_cas_intmask_m:
+	.arch armv6-m
+	.thumb
+	mov r3,r0
+	dmb
+	cpsid i
+1:	ldr r0,[r2]
+	subs r0,r3,r0
+	bne 1b
+	str r1,[r2]
+	cpsie i
+	dmb
+	bx lr
+
 .global __a_gettp_cp15
 .hidden __a_gettp_cp15
 .type __a_gettp_cp15,%function
 __a_gettp_cp15:
+	.arch armv6
+	.arm
 	mrc p15,0,r0,c13,c0,3
 	bx lr
 
+.global __a_gettp_cp15_m
+.hidden __a_gettp_cp15_m
+.type __a_gettp_cp15_m,%function
+__a_gettp_cp15_m:
+	.arch armv7-m
+	.thumb
+	mrc p15,0,r0,c13,c0,3
+	bx lr
+
+.global __a_gettp_syscall_m
+.hidden __a_gettp_syscall_m
+.type __a_gettp_syscall_m,%function
+__a_gettp_syscall_m:
+	.arch armv6-m
+	.thumb
+	push {r7}
+	movs r7,#0xf
+	lsls r7,r7,#16
+	adds r7,#6          /* ARM get_tls syscall (0xf0006) */
+	svc 0
+	pop {r7}
+	bx lr
+
 /* Tag this file with minimum ISA level so as not to affect linking. */
 .object_arch armv4t
 .eabi_attribute 6,2
diff --git a/src/thread/arm/clone.s b/src/thread/arm/clone.s
index bb0965da..33c2e59b 100644
--- a/src/thread/arm/clone.s
+++ b/src/thread/arm/clone.s
@@ -4,24 +4,26 @@
 .hidden __clone
 .type   __clone,%function
 __clone:
-	stmfd sp!,{r4,r5,r6,r7}
-	mov r7,#120
+	push {r4,r5,r6,r7}
+	movs r7,#120
 	mov r6,r3
 	mov r5,r0
 	mov r0,r2
-	and r1,r1,#-16
+	movs r2,#0
+	subs r2,#16
+	ands r1,r2
 	ldr r2,[sp,#16]
 	ldr r3,[sp,#20]
 	ldr r4,[sp,#24]
 	svc 0
 	tst r0,r0
 	beq 1f
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	bx lr
 
 1:	mov r0,r6
 	bl 3f
-2:	mov r7,#1
+2:	movs r7,#1
 	svc 0
 	b 2b
 
diff --git a/src/thread/arm/syscall_cp.s b/src/thread/arm/syscall_cp.s
index e607dd42..421e64f4 100644
--- a/src/thread/arm/syscall_cp.s
+++ b/src/thread/arm/syscall_cp.s
@@ -11,7 +11,7 @@
 .type __syscall_cp_asm,%function
 __syscall_cp_asm:
 	mov ip,sp
-	stmfd sp!,{r4,r5,r6,r7}
+	push {r4,r5,r6,r7}
 __cp_begin:
 	ldr r0,[r0]
 	cmp r0,#0
@@ -19,11 +19,12 @@ __cp_begin:
 	mov r7,r1
 	mov r0,r2
 	mov r1,r3
-	ldmfd ip,{r2,r3,r4,r5,r6}
+	mov r2,ip
+	ldmfd r2,{r2,r3,r4,r5,r6}
 	svc 0
 __cp_end:
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	bx lr
 __cp_cancel:
-	ldmfd sp!,{r4,r5,r6,r7}
+	pop {r4,r5,r6,r7}
 	b __cancel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2021-01-06  3:24           ` Jesse DeGuire
@ 2021-01-06  4:01             ` Patrick Oppenlander
  2021-01-13 23:51               ` Jesse DeGuire
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Oppenlander @ 2021-01-06  4:01 UTC (permalink / raw)
  To: Jesse DeGuire; +Cc: musl

On Wed, Jan 6, 2021 at 2:25 PM Jesse DeGuire <jesse.a.deguire@gmail.com> wrote:
> >
> > Sorry, I missed this reply.
> >
> > qemu-user refuses to translate the instruction because cp15 is not
> > implemented on armv7-m, exactly the same issue as is being discussed
> > here. If you run the same executable but tell qemu to emulate an A
> > profile core instead it happily runs it.
> >
> > Linux will probably kill the executable with SIGILL or something like
> > that (I haven't tried, just guessing).
> >
> > It's related to this discussion as changing musl to use the syscall
> > will likely result in qemu-user working too.
> >
> > I would personally prefer to see a solution which doesn't use the
> > syscall. It's possible to implement the trap-and-emulate much more
> > efficiently than the syscall as it can quite easily be done without
> > preserving any more registers than the core pushes on exception entry
> > anyway. https://github.com/apexrtos/apex/blob/master/sys/arch/arm/v7m/emulate.S
> > is what I came up with. That implementation could be even tighter as
> > it can never run from handler mode, so the stack detection at the
> > beginning is unnecessary. However, I haven't considered v6-m or v8-m.
> >
> > trap-and-emulate also gracefully degrades when running the same
> > executable on A vs M cores.
> >
> > Patrick
>
> Any thoughts on what's shown in this patch? For your RTOS and v7m/v8m,
> I'm thinking you'd be able to get the behavior you want by setting the
> HWCAP_TLS flag early in your startup code. For my purposes, I plan to
> use the syscall because I intend to eventually make a "baremetal" arch
> in Musl that turns syscalls into simple function calls. Therefore, I'd
> clear the flag in my startup code.

What you've proposed (using HWCAP_TLS) sounds good to me. Apex tries
very hard to look like Linux to userspace, so anything that works for
Linux will work for it. As an aside, I also use a forked and reduced
musl as the kernel libc - a baremetal target could be handy there.

It might be nice to have a configure option to disable the syscall
fallback completely (compile time constant test of some kind?) to
allow inlining of the mrc instruction. If such an executable ever
tried to run on an unsupported platform it would be killed with an
invalid instruction exception.

If it's acceptable to have that type of configure-time tuning we can
also disable mutex & rwlock spins (set spins to 0) as an optimisation
for uniprocessor targets.

Patrick

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [musl] Musl on Cortex-M Devices
  2021-01-06  4:01             ` Patrick Oppenlander
@ 2021-01-13 23:51               ` Jesse DeGuire
  0 siblings, 0 replies; 17+ messages in thread
From: Jesse DeGuire @ 2021-01-13 23:51 UTC (permalink / raw)
  To: musl; +Cc: Patrick Oppenlander

[-- Attachment #1: Type: text/plain, Size: 3759 bytes --]

This version fixes a few warnings, fixes some CRLF line endings, and
changes the syscall variant of "__a_gettp" to a C function that uses
the __syscall() macro instead of hardcoded assembly. That last change
was done to better take advantage of what Musl already provides and
because it will let me more easily handle the thread pointer should I
make a "baremetal" target.

On another note, I notice that other thread subjects are tagged with
things like "[musl]" or "[patch]". Am I supposed to add those or is
OpenWall supposed to do that? The former seems rather redundant to me,
but the latter does make sense.

On Tue, Jan 5, 2021 at 11:01 PM Patrick Oppenlander
<patrick.oppenlander@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 2:25 PM Jesse DeGuire <jesse.a.deguire@gmail.com> wrote:
> > >
> > > Sorry, I missed this reply.
> > >
> > > qemu-user refuses to translate the instruction because cp15 is not
> > > implemented on armv7-m, exactly the same issue as is being discussed
> > > here. If you run the same executable but tell qemu to emulate an A
> > > profile core instead it happily runs it.
> > >
> > > Linux will probably kill the executable with SIGILL or something like
> > > that (I haven't tried, just guessing).
> > >
> > > It's related to this discussion as changing musl to use the syscall
> > > will likely result in qemu-user working too.
> > >
> > > I would personally prefer to see a solution which doesn't use the
> > > syscall. It's possible to implement the trap-and-emulate much more
> > > efficiently than the syscall as it can quite easily be done without
> > > preserving any more registers than the core pushes on exception entry
> > > anyway. https://github.com/apexrtos/apex/blob/master/sys/arch/arm/v7m/emulate.S
> > > is what I came up with. That implementation could be even tighter as
> > > it can never run from handler mode, so the stack detection at the
> > > beginning is unnecessary. However, I haven't considered v6-m or v8-m.
> > >
> > > trap-and-emulate also gracefully degrades when running the same
> > > executable on A vs M cores.
> > >
> > > Patrick
> >
> > Any thoughts on what's shown in this patch? For your RTOS and v7m/v8m,
> > I'm thinking you'd be able to get the behavior you want by setting the
> > HWCAP_TLS flag early in your startup code. For my purposes, I plan to
> > use the syscall because I intend to eventually make a "baremetal" arch
> > in Musl that turns syscalls into simple function calls. Therefore, I'd
> > clear the flag in my startup code.
>
> What you've proposed (using HWCAP_TLS) sounds good to me. Apex tries
> very hard to look like Linux to userspace, so anything that works for
> Linux will work for it. As an aside, I also use a forked and reduced
> musl as the kernel libc - a baremetal target could be handy there.
>
> It might be nice to have a configure option to disable the syscall
> fallback completely (compile time constant test of some kind?) to
> allow inlining of the mrc instruction. If such an executable ever
> tried to run on an unsupported platform it would be killed with an
> invalid instruction exception.
>
> If it's acceptable to have that type of configure-time tuning we can
> also disable mutex & rwlock spins (set spins to 0) as an optimisation
> for uniprocessor targets.
>
> Patrick

I still need to figure out how I want to really handle a baremetal
target, though it does seem like Musl makes adding targets relatively
easy to add. I also don't mind adding more compile-time configuration
options, but I'll leave whether I should or not up to others. I
noticed that Musl doesn't really have those (except when needed to
figure out architecture variants), so I'm not sure if avoiding such
checks is part of an overall policy for Musl.

-Jesse

[-- Attachment #2: musl_cortexm_v4.patch --]
[-- Type: application/x-patch, Size: 13959 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-01-13 23:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 23:43 [musl] Musl on Cortex-M Devices Jesse DeGuire
2020-12-17  0:23 ` Rich Felker
2020-12-17  4:55   ` Patrick Oppenlander
2020-12-17  5:10     ` Patrick Oppenlander
2020-12-18  7:17       ` Jesse DeGuire
2020-12-22 21:43         ` Patrick Oppenlander
2021-01-06  3:24           ` Jesse DeGuire
2021-01-06  4:01             ` Patrick Oppenlander
2021-01-13 23:51               ` Jesse DeGuire
2020-12-18  7:15   ` Jesse DeGuire
2020-12-18 17:30     ` Rich Felker
2020-12-21 23:58       ` Jesse DeGuire
2020-12-22  1:39         ` Rich Felker
2020-12-18 19:38   ` Markus Wichmann
2020-12-18 20:34     ` Rich Felker
2020-12-22  0:00       ` Jesse DeGuire
2020-12-22  1:40         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).