* [musl] [RFC 00/14] aarch64: Convert to inline asm
@ 2025-12-08 17:44 Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 01/14] aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI Bill Roberts
` (14 more replies)
0 siblings, 15 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Based on previous discussions on enabling PAC and BTI for Aarch64
targets, rather than annotating the existing assembler, use inline
assembly and mix of C. Now this has the benefits of:
1. Handling PAC, BTI and GCS.
a. prologue and eplilog insertion as needed.
b. Adding GNU notes as needed.
2. Adding in the CFI statements as needed.
I'd love to get feedback, thanks!
Bill Roberts (14):
aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
aarch64: rewrite fenv routines in C using inline asm
aarch64: rewrite vfork routine in C using inline asm
aarch64: rewrite clone routine in C using inline asm
aarch64: rewrite __syscall_cp_asm in C using inline asm
aarch64: rewrite __unmapself in C using inline asm
aarch64: rewrite tlsdesc reoutines in C using inline asm
aarch64: rewrite __restore_rt routines in C using inline asm
aarch64: rewrite longjmp routines in C using inline asm
aarch64: rewrite setjmp routines in C using inline asm
aarch64: rewrite sigsetjmp routines in C using inline asm
aarch64: rewrite dlsym routine in C using inline asm
aarch64: rewrite memcpy routine in C using inline asm
aarch64: rewrite memset routine in C using inline asm
arch/aarch64/crt_arch.h | 29 ++---
crt/aarch64/crti.s | 15 ---
crt/aarch64/crtn.s | 7 --
src/fenv/aarch64/fenv.c | 96 ++++++++++++++++
src/fenv/aarch64/fenv.s | 68 -----------
src/ldso/aarch64/dlsym.c | 11 ++
src/ldso/aarch64/dlsym.s | 6 -
src/ldso/aarch64/tlsdesc.c | 50 +++++++++
src/ldso/aarch64/tlsdesc.s | 31 ------
src/process/aarch64/vfork.c | 21 ++++
src/process/aarch64/vfork.s | 9 --
src/setjmp/aarch64/longjmp.c | 39 +++++++
src/setjmp/aarch64/longjmp.s | 23 ----
src/setjmp/aarch64/setjmp.c | 34 ++++++
src/setjmp/aarch64/setjmp.s | 24 ----
src/signal/aarch64/restore.c | 15 +++
src/signal/aarch64/restore.s | 10 --
src/signal/aarch64/sigsetjmp.c | 43 +++++++
src/signal/aarch64/sigsetjmp.s | 21 ----
src/string/aarch64/memcpy.S | 186 -------------------------------
src/string/aarch64/memcpy.c | 168 ++++++++++++++++++++++++++++
src/string/aarch64/memset.S | 115 -------------------
src/string/aarch64/memset.c | 122 ++++++++++++++++++++
src/thread/aarch64/__unmapself.c | 16 +++
src/thread/aarch64/__unmapself.s | 7 --
src/thread/aarch64/clone.c | 44 ++++++++
src/thread/aarch64/clone.s | 31 ------
src/thread/aarch64/syscall_cp.c | 61 ++++++++++
src/thread/aarch64/syscall_cp.s | 32 ------
29 files changed, 736 insertions(+), 598 deletions(-)
delete mode 100644 crt/aarch64/crti.s
delete mode 100644 crt/aarch64/crtn.s
create mode 100644 src/fenv/aarch64/fenv.c
delete mode 100644 src/fenv/aarch64/fenv.s
create mode 100644 src/ldso/aarch64/dlsym.c
delete mode 100644 src/ldso/aarch64/dlsym.s
create mode 100644 src/ldso/aarch64/tlsdesc.c
delete mode 100644 src/ldso/aarch64/tlsdesc.s
create mode 100644 src/process/aarch64/vfork.c
delete mode 100644 src/process/aarch64/vfork.s
create mode 100644 src/setjmp/aarch64/longjmp.c
delete mode 100644 src/setjmp/aarch64/longjmp.s
create mode 100644 src/setjmp/aarch64/setjmp.c
delete mode 100644 src/setjmp/aarch64/setjmp.s
create mode 100644 src/signal/aarch64/restore.c
delete mode 100644 src/signal/aarch64/restore.s
create mode 100644 src/signal/aarch64/sigsetjmp.c
delete mode 100644 src/signal/aarch64/sigsetjmp.s
delete mode 100644 src/string/aarch64/memcpy.S
create mode 100644 src/string/aarch64/memcpy.c
delete mode 100644 src/string/aarch64/memset.S
create mode 100644 src/string/aarch64/memset.c
create mode 100644 src/thread/aarch64/__unmapself.c
delete mode 100644 src/thread/aarch64/__unmapself.s
create mode 100644 src/thread/aarch64/clone.c
delete mode 100644 src/thread/aarch64/clone.s
create mode 100644 src/thread/aarch64/syscall_cp.c
delete mode 100644 src/thread/aarch64/syscall_cp.s
--
2.51.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [musl] [RFC 01/14] aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 02/14] aarch64: rewrite fenv routines in C using inline asm Bill Roberts
` (13 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
These stubs define legacy .init/.fini sections that are not used when
NO_LEGACY_INITFINI is set. Initialization and finalization are handled
exclusively via .init_array/.fini_array on these targets.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
crt/aarch64/crti.s | 15 ---------------
crt/aarch64/crtn.s | 7 -------
2 files changed, 22 deletions(-)
delete mode 100644 crt/aarch64/crti.s
delete mode 100644 crt/aarch64/crtn.s
diff --git a/crt/aarch64/crti.s b/crt/aarch64/crti.s
deleted file mode 100644
index 3776fa64..00000000
--- a/crt/aarch64/crti.s
+++ /dev/null
@@ -1,15 +0,0 @@
-.section .init
-.global _init
-.type _init,%function
-.align 2
-_init:
- stp x29,x30,[sp,-16]!
- mov x29,sp
-
-.section .fini
-.global _fini
-.type _fini,%function
-.align 2
-_fini:
- stp x29,x30,[sp,-16]!
- mov x29,sp
diff --git a/crt/aarch64/crtn.s b/crt/aarch64/crtn.s
deleted file mode 100644
index 73cab692..00000000
--- a/crt/aarch64/crtn.s
+++ /dev/null
@@ -1,7 +0,0 @@
-.section .init
- ldp x29,x30,[sp],#16
- ret
-
-.section .fini
- ldp x29,x30,[sp],#16
- ret
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 02/14] aarch64: rewrite fenv routines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 01/14] aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 03/14] aarch64: rewrite vfork routine " Bill Roberts
` (12 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 floating-point environment routines (fegetround,
__fesetround, fetestexcept, feclearexcept, feraiseexcept, fegetenv,
fesetenv) from assembly into C implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in fenv.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
The new implementations mirror the original assembly semantics exactly:
- access FPCR/FPSR using `mrs`/`msr` inline asm
- preserve the same FE_* masks and return conventions
- retain `__fesetround` as a hidden internal symbol
- support the special FE_DFL_ENV case in fesetenv
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
arch/aarch64/crt_arch.h | 29 +++++++------
src/fenv/aarch64/fenv.c | 96 +++++++++++++++++++++++++++++++++++++++++
src/fenv/aarch64/fenv.s | 68 -----------------------------
3 files changed, 112 insertions(+), 81 deletions(-)
create mode 100644 src/fenv/aarch64/fenv.c
delete mode 100644 src/fenv/aarch64/fenv.s
diff --git a/arch/aarch64/crt_arch.h b/arch/aarch64/crt_arch.h
index b64fb3dd..cff8edb3 100644
--- a/arch/aarch64/crt_arch.h
+++ b/arch/aarch64/crt_arch.h
@@ -1,15 +1,18 @@
__asm__(
-".text \n"
-".global " START "\n"
-".type " START ",%function\n"
-START ":\n"
-" mov x29, #0\n"
-" mov x30, #0\n"
-" mov x0, sp\n"
-".weak _DYNAMIC\n"
-".hidden _DYNAMIC\n"
-" adrp x1, _DYNAMIC\n"
-" add x1, x1, #:lo12:_DYNAMIC\n"
-" and sp, x0, #-16\n"
-" b " START "_c\n"
+".text \n\t"
+".global " START "\n\t"
+".type " START ",%function\n\t"
+START ":\n\t"
+#if defined(__ARM_FEATURE_BTI_DEFAULT)
+" hint 34\n\t"
+#endif
+" mov x29, #0\n\t"
+" mov x30, #0\n\t"
+" mov x0, sp\n\t"
+".weak _DYNAMIC\n\t"
+".hidden _DYNAMIC\n\t"
+" adrp x1, _DYNAMIC\n\t"
+" add x1, x1, #:lo12:_DYNAMIC\n\t"
+" and sp, x0, #-16\n\t"
+" b " START "_c\n\t"
);
diff --git a/src/fenv/aarch64/fenv.c b/src/fenv/aarch64/fenv.c
new file mode 100644
index 00000000..6d84feac
--- /dev/null
+++ b/src/fenv/aarch64/fenv.c
@@ -0,0 +1,96 @@
+#include <fenv.h>
+#include <stdint.h>
+
+#define FE_RMODE_MASK 0x00C00000u // FPCR RMode bits [23:22]
+#define FE_EXC_MASK 0x0000001Fu // FPSR exception flags [4:0]
+
+static inline uint32_t read_fpcr_u32(void)
+{
+ uint64_t x;
+ __asm__ volatile ("mrs %0, fpcr" : "=r"(x));
+ return (uint32_t)x;
+}
+
+static inline void write_fpcr_u32(uint32_t v)
+{
+ uint64_t x = (uint64_t)v;
+ __asm__ volatile ("msr fpcr, %0" :: "r"(x) : "memory");
+}
+
+static inline uint32_t read_fpsr_u32(void)
+{
+ uint64_t x;
+ __asm__ volatile ("mrs %0, fpsr" : "=r"(x));
+ return (uint32_t)x;
+}
+
+static inline void write_fpsr_u32(uint32_t v)
+{
+ uint64_t x = (uint64_t)v;
+ __asm__ volatile ("msr fpsr, %0" :: "r"(x) : "memory");
+}
+
+int fegetround(void)
+{
+ uint32_t fpcr = read_fpcr_u32();
+ return (int)(fpcr & FE_RMODE_MASK);
+}
+
+__attribute__((__visibility__("hidden")))
+int __fesetround(int rm)
+{
+ uint32_t fpcr = read_fpcr_u32();
+ fpcr &= ~FE_RMODE_MASK;
+ fpcr |= (uint32_t)rm;
+ write_fpcr_u32(fpcr);
+ return 0;
+}
+
+int fetestexcept(int mask)
+{
+ uint32_t m = (uint32_t)mask & FE_EXC_MASK;
+ uint32_t fpsr = read_fpsr_u32();
+ return (int)(m & fpsr);
+}
+
+int feclearexcept(int mask)
+{
+ uint32_t m = (uint32_t)mask & FE_EXC_MASK;
+ uint32_t fpsr = read_fpsr_u32();
+ fpsr &= ~m;
+ write_fpsr_u32(fpsr);
+ return 0;
+}
+
+int feraiseexcept(int mask)
+{
+ uint32_t m = (uint32_t)mask & FE_EXC_MASK;
+ uint32_t fpsr = read_fpsr_u32();
+ fpsr |= m;
+ write_fpsr_u32(fpsr);
+ return 0;
+}
+
+int fegetenv(fenv_t *env)
+{
+ uint32_t fpcr = read_fpcr_u32();
+ uint32_t fpsr = read_fpsr_u32();
+ env->__fpcr = fpcr;
+ env->__fpsr = fpsr;
+ return 0;
+}
+
+int fesetenv(const fenv_t *env)
+{
+ uint32_t fpcr = 0;
+ uint32_t fpsr = 0;
+
+ if (env != FE_DFL_ENV) {
+ fpcr = env->__fpcr;
+ fpsr = env->__fpsr;
+ }
+
+ write_fpcr_u32(fpcr);
+ write_fpsr_u32(fpsr);
+ return 0;
+}
diff --git a/src/fenv/aarch64/fenv.s b/src/fenv/aarch64/fenv.s
deleted file mode 100644
index 8f3ec965..00000000
--- a/src/fenv/aarch64/fenv.s
+++ /dev/null
@@ -1,68 +0,0 @@
-.global fegetround
-.type fegetround,%function
-fegetround:
- mrs x0, fpcr
- and w0, w0, #0xc00000
- ret
-
-.global __fesetround
-.hidden __fesetround
-.type __fesetround,%function
-__fesetround:
- mrs x1, fpcr
- bic w1, w1, #0xc00000
- orr w1, w1, w0
- msr fpcr, x1
- mov w0, #0
- ret
-
-.global fetestexcept
-.type fetestexcept,%function
-fetestexcept:
- and w0, w0, #0x1f
- mrs x1, fpsr
- and w0, w0, w1
- ret
-
-.global feclearexcept
-.type feclearexcept,%function
-feclearexcept:
- and w0, w0, #0x1f
- mrs x1, fpsr
- bic w1, w1, w0
- msr fpsr, x1
- mov w0, #0
- ret
-
-.global feraiseexcept
-.type feraiseexcept,%function
-feraiseexcept:
- and w0, w0, #0x1f
- mrs x1, fpsr
- orr w1, w1, w0
- msr fpsr, x1
- mov w0, #0
- ret
-
-.global fegetenv
-.type fegetenv,%function
-fegetenv:
- mrs x1, fpcr
- mrs x2, fpsr
- stp w1, w2, [x0]
- mov w0, #0
- ret
-
-// TODO preserve some bits
-.global fesetenv
-.type fesetenv,%function
-fesetenv:
- mov x1, #0
- mov x2, #0
- cmn x0, #1
- b.eq 1f
- ldp w1, w2, [x0]
-1: msr fpcr, x1
- msr fpsr, x2
- mov w0, #0
- ret
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 03/14] aarch64: rewrite vfork routine in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 01/14] aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 02/14] aarch64: rewrite fenv routines in C using inline asm Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-11 12:09 ` Florian Weimer
2025-12-08 17:44 ` [musl] [RFC 04/14] aarch64: rewrite clone " Bill Roberts
` (11 subsequent siblings)
14 siblings, 1 reply; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 vfork routine from assembly into
C implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in vfork.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/process/aarch64/vfork.c | 21 +++++++++++++++++++++
src/process/aarch64/vfork.s | 9 ---------
2 files changed, 21 insertions(+), 9 deletions(-)
create mode 100644 src/process/aarch64/vfork.c
delete mode 100644 src/process/aarch64/vfork.s
diff --git a/src/process/aarch64/vfork.c b/src/process/aarch64/vfork.c
new file mode 100644
index 00000000..87ec8ebf
--- /dev/null
+++ b/src/process/aarch64/vfork.c
@@ -0,0 +1,21 @@
+#include <sys/types.h>
+
+#include "syscall.h"
+
+pid_t vfork(void)
+{
+ /* aarch64 Linux syscall: x8 = nr, x0..x5 = args, ret in x0 */
+ register long x8 __asm__("x8") = 220; /* SYS_clone */
+ register long x0 __asm__("x0") = 0x4111; /* SIGCHLD | CLONE_VM | CLONE_VFORK */
+ register long x1 __asm__("x1") = 0; /* arg2 = 0 */
+
+ __asm__ volatile (
+ "svc 0\n\t"
+ ".hidden __syscall_ret\n\t"
+ "b __syscall_ret\n\t"
+ : "+r"(x0) /* x0 = in/out */
+ : "r"(x1), "r"(x8) /* inputs */
+ : "memory", "cc"
+ );
+ __builtin_unreachable();
+}
diff --git a/src/process/aarch64/vfork.s b/src/process/aarch64/vfork.s
deleted file mode 100644
index 429bec8c..00000000
--- a/src/process/aarch64/vfork.s
+++ /dev/null
@@ -1,9 +0,0 @@
-.global vfork
-.type vfork,%function
-vfork:
- mov x8, 220 // SYS_clone
- mov x0, 0x4111 // SIGCHLD | CLONE_VM | CLONE_VFORK
- mov x1, 0
- svc 0
- .hidden __syscall_ret
- b __syscall_ret
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 04/14] aarch64: rewrite clone routine in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (2 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 03/14] aarch64: rewrite vfork routine " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 05/14] aarch64: rewrite __syscall_cp_asm " Bill Roberts
` (10 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 clean routine from assembly into
C implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in vfork.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/thread/aarch64/clone.c | 44 ++++++++++++++++++++++++++++++++++++++
src/thread/aarch64/clone.s | 31 ---------------------------
2 files changed, 44 insertions(+), 31 deletions(-)
create mode 100644 src/thread/aarch64/clone.c
delete mode 100644 src/thread/aarch64/clone.s
diff --git a/src/thread/aarch64/clone.c b/src/thread/aarch64/clone.c
new file mode 100644
index 00000000..f69d7b42
--- /dev/null
+++ b/src/thread/aarch64/clone.c
@@ -0,0 +1,44 @@
+// __clone(func, stack, flags, arg, ptid, tls, ctid)
+// x0, x1, w2, x3, x4, x5, x6
+
+// syscall(SYS_clone, flags, stack, ptid, tls, ctid)
+// x8, x0, x1, x2, x3, x4
+
+#include "syscall.h"
+
+__attribute__((visibility("hidden")))
+int __clone(int (*fn)(void *), void *stack, int flags, void *arg,
+ int *ptid, void *tls, int *ctid)
+{
+ __asm__ __volatile__(
+ // align stack and save func,arg
+ "and x1, x1, #-16\n\t"
+ "stp x0, x3, [x1, #-16]!\n\t"
+
+ // syscall: clone(flags, stack, ptid, tls, ctid)
+ "uxtw x0, w2\n\t" // x0 = (uint32_t)flags
+ "mov x2, x4\n\t" // ptid
+ "mov x3, x5\n\t" // tls
+ "mov x4, x6\n\t" // ctid
+ "mov x8, #220\n\t" // SYS_clone
+ "svc #0\n\t"
+
+ "cbz x0, 1f\n\t" // child gets 0
+ // parent: returns to caller with x0 = pid / -errno
+ "ret\n\t"
+
+ // child
+ "1:\n\t"
+ "mov x29, xzr\n\t"
+ "ldp x1, x0, [sp], #16\n\t" // x1=fn, x0=arg
+ "blr x1\n\t" // fn(arg) -> x0
+ "mov x8, #93\n\t" // SYS_exit
+ "svc #0\n\t"
+ :
+ :
+ : "x1","x2","x3","x4","x8",
+ "memory","cc"
+ );
+ __builtin_unreachable();
+}
+
diff --git a/src/thread/aarch64/clone.s b/src/thread/aarch64/clone.s
deleted file mode 100644
index aff8155b..00000000
--- a/src/thread/aarch64/clone.s
+++ /dev/null
@@ -1,31 +0,0 @@
-// __clone(func, stack, flags, arg, ptid, tls, ctid)
-// x0, x1, w2, x3, x4, x5, x6
-
-// syscall(SYS_clone, flags, stack, ptid, tls, ctid)
-// x8, x0, x1, x2, x3, x4
-
-.global __clone
-.hidden __clone
-.type __clone,%function
-__clone:
- // align stack and save func,arg
- and x1,x1,#-16
- stp x0,x3,[x1,#-16]!
-
- // syscall
- uxtw x0,w2
- mov x2,x4
- mov x3,x5
- mov x4,x6
- mov x8,#220 // SYS_clone
- svc #0
-
- cbz x0,1f
- // parent
- ret
- // child
-1: mov x29, 0
- ldp x1,x0,[sp],#16
- blr x1
- mov x8,#93 // SYS_exit
- svc #0
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 05/14] aarch64: rewrite __syscall_cp_asm in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (3 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 04/14] aarch64: rewrite clone " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 06/14] aarch64: rewrite __unmapself " Bill Roberts
` (9 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __syscall_cp_asm routine from assembly into
C implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in syscall_cp.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/thread/aarch64/syscall_cp.c | 61 +++++++++++++++++++++++++++++++++
src/thread/aarch64/syscall_cp.s | 32 -----------------
2 files changed, 61 insertions(+), 32 deletions(-)
create mode 100644 src/thread/aarch64/syscall_cp.c
delete mode 100644 src/thread/aarch64/syscall_cp.s
diff --git a/src/thread/aarch64/syscall_cp.c b/src/thread/aarch64/syscall_cp.c
new file mode 100644
index 00000000..d350be8f
--- /dev/null
+++ b/src/thread/aarch64/syscall_cp.c
@@ -0,0 +1,61 @@
+// __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z)
+// x0 x1 x2 x3 x4 x5 x6 x7
+
+// syscall(nr, u, v, w, x, y, z)
+// x8 x0 x1 x2 x3 x4 x5
+
+__attribute__((visibility("hidden")))
+long __syscall_cp_asm(volatile int *cancel, long nr,
+ long a0, long a1, long a2, long a3, long a4, long a5)
+{
+ long ret;
+
+ __asm__ volatile(
+ // BEGIN marker
+ ".globl __cp_begin\n\t"
+ ".hidden __cp_begin\n\t"
+ "__cp_begin:\n\t"
+
+ "ldr w0, [x0]\n\t" // if (*cancel) goto __cp_cancel
+ "cbnz w0, __cp_cancel\n\t"
+
+ // Pack syscall args:
+ // syscall(nr, u, v, w, x, y, z)
+ // x8 x0 x1 x2 x3 x4 x5
+ // currently:
+ // x1 = nr
+ // x2..x7 = a0..a5
+ "mov x8, x1\n\t"
+ "mov x0, x2\n\t"
+ "mov x1, x3\n\t"
+ "mov x2, x4\n\t"
+ "mov x3, x5\n\t"
+ "mov x4, x6\n\t"
+ "mov x5, x7\n\t"
+ "svc #0\n"
+
+ // END marker
+ ".globl __cp_end\n\t"
+ ".hidden __cp_end\n\t"
+ "__cp_end:\n\t"
+
+ // Save syscall result from x0 into a C variable.
+ "mov %0, x0\n\t"
+ : "=r"(ret)
+ : /* all inputs are already in x0..x7 per the ABI */
+ : "x0","x1","x2","x3","x4","x5","x6","x7",
+ "x8","memory","cc"
+ );
+
+ return ret;
+}
+
+
+__attribute__((visibility("hidden")))
+long __cancel(void);
+
+__attribute__((visibility("hidden")))
+long __cp_cancel(void)
+{
+ return __cancel();
+}
diff --git a/src/thread/aarch64/syscall_cp.s b/src/thread/aarch64/syscall_cp.s
deleted file mode 100644
index 41db68af..00000000
--- a/src/thread/aarch64/syscall_cp.s
+++ /dev/null
@@ -1,32 +0,0 @@
-// __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z)
-// x0 x1 x2 x3 x4 x5 x6 x7
-
-// syscall(nr, u, v, w, x, y, z)
-// x8 x0 x1 x2 x3 x4 x5
-
-.global __cp_begin
-.hidden __cp_begin
-.global __cp_end
-.hidden __cp_end
-.global __cp_cancel
-.hidden __cp_cancel
-.hidden __cancel
-.global __syscall_cp_asm
-.hidden __syscall_cp_asm
-.type __syscall_cp_asm,%function
-__syscall_cp_asm:
-__cp_begin:
- ldr w0,[x0]
- cbnz w0,__cp_cancel
- mov x8,x1
- mov x0,x2
- mov x1,x3
- mov x2,x4
- mov x3,x5
- mov x4,x6
- mov x5,x7
- svc 0
-__cp_end:
- ret
-__cp_cancel:
- b __cancel
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 06/14] aarch64: rewrite __unmapself in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (4 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 05/14] aarch64: rewrite __syscall_cp_asm " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines " Bill Roberts
` (8 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __unmapself routine from assembly into
C implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in __unmapself.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/thread/aarch64/__unmapself.c | 16 ++++++++++++++++
src/thread/aarch64/__unmapself.s | 7 -------
2 files changed, 16 insertions(+), 7 deletions(-)
create mode 100644 src/thread/aarch64/__unmapself.c
delete mode 100644 src/thread/aarch64/__unmapself.s
diff --git a/src/thread/aarch64/__unmapself.c b/src/thread/aarch64/__unmapself.c
new file mode 100644
index 00000000..12639609
--- /dev/null
+++ b/src/thread/aarch64/__unmapself.c
@@ -0,0 +1,16 @@
+#include <stddef.h>
+
+__attribute__((visibility("hidden"), noreturn))
+void __unmapself(void *addr, size_t len)
+{
+ __asm__ volatile(
+ "mov x8, #215\n\t" // SYS_munmap
+ "svc #0\n\t"
+ "mov x8, #93\n\t" // SYS_exit
+ "svc #0\n\t"
+ :
+ :
+ : "x8"
+ );
+ __builtin_unreachable();
+}
diff --git a/src/thread/aarch64/__unmapself.s b/src/thread/aarch64/__unmapself.s
deleted file mode 100644
index 2c5d254f..00000000
--- a/src/thread/aarch64/__unmapself.s
+++ /dev/null
@@ -1,7 +0,0 @@
-.global __unmapself
-.type __unmapself,%function
-__unmapself:
- mov x8,#215 // SYS_munmap
- svc 0
- mov x8,#93 // SYS_exit
- svc 0
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (5 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 06/14] aarch64: rewrite __unmapself " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-11 12:10 ` Florian Weimer
2025-12-08 17:44 ` [musl] [RFC 08/14] aarch64: rewrite __restore_rt routines " Bill Roberts
` (7 subsequent siblings)
14 siblings, 1 reply; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __tlsdesc_dynamic and __tlsdesc_static
routines from assembly into implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in tlsdesc.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/ldso/aarch64/tlsdesc.c | 50 ++++++++++++++++++++++++++++++++++++++
src/ldso/aarch64/tlsdesc.s | 31 -----------------------
2 files changed, 50 insertions(+), 31 deletions(-)
create mode 100644 src/ldso/aarch64/tlsdesc.c
delete mode 100644 src/ldso/aarch64/tlsdesc.s
diff --git a/src/ldso/aarch64/tlsdesc.c b/src/ldso/aarch64/tlsdesc.c
new file mode 100644
index 00000000..224a9387
--- /dev/null
+++ b/src/ldso/aarch64/tlsdesc.c
@@ -0,0 +1,50 @@
+#include <stddef.h>
+#include <stdint.h>
+
+/* size_t __tlsdesc_static(size_t *a) { return a[1]; } */
+__attribute__((visibility("hidden")))
+size_t __tlsdesc_static(size_t *a)
+{
+ size_t result;
+
+ __asm__ __volatile__(
+ "ldr %0, [%1, #8]\n\t" /* result = *(a + 8) */
+ : "=r"(result)
+ : "r"(a)
+ : "memory"
+ );
+
+ return result;
+}
+
+/*
+ * size_t __tlsdesc_dynamic(size_t *a)
+ * {
+ * struct { size_t modidx, off; } *p = (void*)a[1];
+ * size_t tp = read_tpidr_el0();
+ * size_t *dtv = *(size_t **)(tp - 8);
+ * return dtv[p->modidx] + p->off - tp;
+ * }
+ */
+__attribute__((visibility("hidden")))
+size_t __tlsdesc_dynamic(size_t *a)
+{
+ size_t result;
+
+ __asm__ __volatile__(
+ "mrs x1, tpidr_el0\n\t" /* x1 := tp */
+ "ldr x0, [x0, #8]\n\t" /* x0 := p = (void*)a[1] */
+ "ldp x0, x2, [x0]\n\t" /* x0 := p->modidx, x2 := p->off */
+ "sub x2, x2, x1\n\t" /* x2 := p->off - tp */
+ "ldr x1, [x1, #-8]\n\t" /* x1 := dtv = *(tp - 8) */
+ "ldr x1, [x1, x0, lsl #3]\n\t" /* x1 := dtv[p->modidx] */
+ "add x0, x1, x2\n\t" /* x0 := dtv[p->modidx] + p->off - tp */
+
+ "mov %0, x0\n\t"
+ : "=r"(result)
+ :
+ : "x1","x2","memory","cc"
+ );
+
+ return result;
+}
diff --git a/src/ldso/aarch64/tlsdesc.s b/src/ldso/aarch64/tlsdesc.s
deleted file mode 100644
index c6c685b3..00000000
--- a/src/ldso/aarch64/tlsdesc.s
+++ /dev/null
@@ -1,31 +0,0 @@
-// size_t __tlsdesc_static(size_t *a)
-// {
-// return a[1];
-// }
-.global __tlsdesc_static
-.hidden __tlsdesc_static
-.type __tlsdesc_static,@function
-__tlsdesc_static:
- ldr x0,[x0,#8]
- ret
-
-// size_t __tlsdesc_dynamic(size_t *a)
-// {
-// struct {size_t modidx,off;} *p = (void*)a[1];
-// size_t *dtv = *(size_t**)(tp - 8);
-// return dtv[p->modidx] + p->off - tp;
-// }
-.global __tlsdesc_dynamic
-.hidden __tlsdesc_dynamic
-.type __tlsdesc_dynamic,@function
-__tlsdesc_dynamic:
- stp x1,x2,[sp,#-16]!
- mrs x1,tpidr_el0 // tp
- ldr x0,[x0,#8] // p
- ldp x0,x2,[x0] // p->modidx, p->off
- sub x2,x2,x1 // p->off - tp
- ldr x1,[x1,#-8] // dtv
- ldr x1,[x1,x0,lsl #3] // dtv[p->modidx]
- add x0,x1,x2 // dtv[p->modidx] + p->off - tp
- ldp x1,x2,[sp],#16
- ret
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 08/14] aarch64: rewrite __restore_rt routines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (6 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 09/14] aarch64: rewrite longjmp " Bill Roberts
` (6 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __restore_rt routine from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in restore.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/signal/aarch64/restore.c | 15 +++++++++++++++
src/signal/aarch64/restore.s | 10 ----------
2 files changed, 15 insertions(+), 10 deletions(-)
create mode 100644 src/signal/aarch64/restore.c
delete mode 100644 src/signal/aarch64/restore.s
diff --git a/src/signal/aarch64/restore.c b/src/signal/aarch64/restore.c
new file mode 100644
index 00000000..f20da148
--- /dev/null
+++ b/src/signal/aarch64/restore.c
@@ -0,0 +1,15 @@
+#include <stddef.h>
+
+__attribute__((visibility("hidden"), noreturn, alias("__restore_rt")))
+void __restore(void);
+
+/* rt_sigreturn on AArch64 = 139 */
+__attribute__((visibility("hidden"), noreturn))
+void __restore_rt(void)
+{
+ __asm__ __volatile__(
+ "mov x8, #139\n\t" /* SYS_rt_sigreturn */
+ "svc #0\n\t"
+ );
+ __builtin_unreachable();
+}
diff --git a/src/signal/aarch64/restore.s b/src/signal/aarch64/restore.s
deleted file mode 100644
index d4e5fcf1..00000000
--- a/src/signal/aarch64/restore.s
+++ /dev/null
@@ -1,10 +0,0 @@
-.global __restore
-.hidden __restore
-.type __restore,%function
-__restore:
-.global __restore_rt
-.hidden __restore_rt
-.type __restore_rt,%function
-__restore_rt:
- mov x8,#139 // SYS_rt_sigreturn
- svc 0
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 09/14] aarch64: rewrite longjmp routines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (7 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 08/14] aarch64: rewrite __restore_rt routines " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 10/14] aarch64: rewrite setjmp " Bill Roberts
` (5 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 _longjmp and longjmp routines from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in longjmp.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/setjmp/aarch64/longjmp.c | 39 ++++++++++++++++++++++++++++++++++++
src/setjmp/aarch64/longjmp.s | 23 ---------------------
2 files changed, 39 insertions(+), 23 deletions(-)
create mode 100644 src/setjmp/aarch64/longjmp.c
delete mode 100644 src/setjmp/aarch64/longjmp.s
diff --git a/src/setjmp/aarch64/longjmp.c b/src/setjmp/aarch64/longjmp.c
new file mode 100644
index 00000000..1ac107e5
--- /dev/null
+++ b/src/setjmp/aarch64/longjmp.c
@@ -0,0 +1,39 @@
+#include <setjmp.h>
+
+_Noreturn void longjmp(jmp_buf env, int val)
+{
+ __asm__ __volatile__(
+ /* Restore integer callee-saved regs x19..x30 */
+ "ldp x19, x20, [x0, #0]\n\t"
+ "ldp x21, x22, [x0, #16]\n\t"
+ "ldp x23, x24, [x0, #32]\n\t"
+ "ldp x25, x26, [x0, #48]\n\t"
+ "ldp x27, x28, [x0, #64]\n\t"
+ "ldp x29, x30, [x0, #80]\n\t"
+
+ /* Restore SP from [x0 + 104] */
+ "ldr x2, [x0, #104]\n\t"
+ "mov sp, x2\n\t"
+
+ /* Restore FP callee-saved d8..d15 */
+ "ldp d8 , d9 , [x0, #112]\n\t"
+ "ldp d10, d11, [x0, #128]\n\t"
+ "ldp d12, d13, [x0, #144]\n\t"
+ "ldp d14, d15, [x0, #160]\n\t"
+
+ /* Compute return value in w0: (w1 != 0 ? w1 : 1) */
+ "cmp w1, #0\n\t"
+ "csinc w0, w1, wzr, ne\n\t"
+
+ /* Jump to saved LR */
+ "br x30\n\t"
+ :
+ :
+ : "memory", "cc" /* no clobbers, we need the register state */
+ );
+
+ __builtin_unreachable();
+}
+
+/* Export _longjmp as an alias of longjmp (same TU). */
+__attribute__((alias("longjmp"))) void _longjmp(jmp_buf env, int val);
diff --git a/src/setjmp/aarch64/longjmp.s b/src/setjmp/aarch64/longjmp.s
deleted file mode 100644
index 0af9c50e..00000000
--- a/src/setjmp/aarch64/longjmp.s
+++ /dev/null
@@ -1,23 +0,0 @@
-.global _longjmp
-.global longjmp
-.type _longjmp,%function
-.type longjmp,%function
-_longjmp:
-longjmp:
- // IHI0055B_aapcs64.pdf 5.1.1, 5.1.2 callee saved registers
- ldp x19, x20, [x0,#0]
- ldp x21, x22, [x0,#16]
- ldp x23, x24, [x0,#32]
- ldp x25, x26, [x0,#48]
- ldp x27, x28, [x0,#64]
- ldp x29, x30, [x0,#80]
- ldr x2, [x0,#104]
- mov sp, x2
- ldp d8 , d9, [x0,#112]
- ldp d10, d11, [x0,#128]
- ldp d12, d13, [x0,#144]
- ldp d14, d15, [x0,#160]
-
- cmp w1, 0
- csinc w0, w1, wzr, ne
- br x30
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 10/14] aarch64: rewrite setjmp routines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (8 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 09/14] aarch64: rewrite longjmp " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 11/14] aarch64: rewrite sigsetjmp " Bill Roberts
` (4 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __setjmp, _setjmp and setjmp routines from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in setjmp.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/setjmp/aarch64/setjmp.c | 34 ++++++++++++++++++++++++++++++++++
src/setjmp/aarch64/setjmp.s | 24 ------------------------
2 files changed, 34 insertions(+), 24 deletions(-)
create mode 100644 src/setjmp/aarch64/setjmp.c
delete mode 100644 src/setjmp/aarch64/setjmp.s
diff --git a/src/setjmp/aarch64/setjmp.c b/src/setjmp/aarch64/setjmp.c
new file mode 100644
index 00000000..2eb3dc68
--- /dev/null
+++ b/src/setjmp/aarch64/setjmp.c
@@ -0,0 +1,34 @@
+#include <setjmp.h>
+
+__attribute__((returns_twice))
+int setjmp(jmp_buf env)
+{
+ __asm__ __volatile__(
+ /* Save integer callee-saved registers x19..x30 */
+ "stp x19, x20, [x0, #0]\n\t"
+ "stp x21, x22, [x0, #16]\n\t"
+ "stp x23, x24, [x0, #32]\n\t"
+ "stp x25, x26, [x0, #48]\n\t"
+ "stp x27, x28, [x0, #64]\n\t"
+ "stp x29, x30, [x0, #80]\n\t"
+
+ /* Save SP at offset 104 */
+ "mov x2, sp\n\t"
+ "str x2, [x0, #104]\n\t"
+
+ /* Save FP/SIMD callee-saved d8..d15 */
+ "stp d8, d9, [x0, #112]\n\t"
+ "stp d10, d11, [x0, #128]\n\t"
+ "stp d12, d13, [x0, #144]\n\t"
+ "stp d14, d15, [x0, #160]\n\t"
+ :
+ :
+ : "x2", "memory"
+ );
+
+ return 0;
+}
+
+/* Make _setjmp and __setjmp the same symbol as setjmp in this TU. */
+__attribute__((alias("setjmp"))) int _setjmp(jmp_buf);
+__attribute__((alias("setjmp"))) int __setjmp(jmp_buf);
diff --git a/src/setjmp/aarch64/setjmp.s b/src/setjmp/aarch64/setjmp.s
deleted file mode 100644
index f49288aa..00000000
--- a/src/setjmp/aarch64/setjmp.s
+++ /dev/null
@@ -1,24 +0,0 @@
-.global __setjmp
-.global _setjmp
-.global setjmp
-.type __setjmp,@function
-.type _setjmp,@function
-.type setjmp,@function
-__setjmp:
-_setjmp:
-setjmp:
- // IHI0055B_aapcs64.pdf 5.1.1, 5.1.2 callee saved registers
- stp x19, x20, [x0,#0]
- stp x21, x22, [x0,#16]
- stp x23, x24, [x0,#32]
- stp x25, x26, [x0,#48]
- stp x27, x28, [x0,#64]
- stp x29, x30, [x0,#80]
- mov x2, sp
- str x2, [x0,#104]
- stp d8, d9, [x0,#112]
- stp d10, d11, [x0,#128]
- stp d12, d13, [x0,#144]
- stp d14, d15, [x0,#160]
- mov x0, #0
- ret
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 11/14] aarch64: rewrite sigsetjmp routines in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (9 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 10/14] aarch64: rewrite setjmp " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 12/14] aarch64: rewrite dlsym routine " Bill Roberts
` (3 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 __sigsetjmp, and sigsetjmp routines from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in sigsetjmp.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/signal/aarch64/sigsetjmp.c | 43 ++++++++++++++++++++++++++++++++++
src/signal/aarch64/sigsetjmp.s | 21 -----------------
2 files changed, 43 insertions(+), 21 deletions(-)
create mode 100644 src/signal/aarch64/sigsetjmp.c
delete mode 100644 src/signal/aarch64/sigsetjmp.s
diff --git a/src/signal/aarch64/sigsetjmp.c b/src/signal/aarch64/sigsetjmp.c
new file mode 100644
index 00000000..3f266bff
--- /dev/null
+++ b/src/signal/aarch64/sigsetjmp.c
@@ -0,0 +1,43 @@
+#include <setjmp.h>
+
+#define OFF_LR 176 /* stored x30 (LR) */
+#define OFF_X19 (OFF_LR + 8 + 8) /* stored x19 */
+
+__attribute__((visibility("hidden")))
+int __sigsetjmp_tail(jmp_buf env, int val);
+
+__attribute__((returns_twice))
+int sigsetjmp(jmp_buf env, int savemask)
+{
+ /*
+ * We could do this in a mix of C and asm, but gcc keeps inserting a PAC
+ * instruction when saving the LR into jmp_buf. So just keep the whole thing
+ * in asm.
+ */
+ __asm__ __volatile__(
+ "cbz x1, setjmp\n\t"
+ /* Save x30 (LR) and x19 into the env at the exact offsets; set x19 = env. */
+ "str x30, [x0, #%c[off_lr]]\n\t"
+ "str x19, [x0, #%c[off_x19]]\n\t"
+ "mov x19, x0\n\t"
+ "bl setjmp\n\t"
+#if defined(__ARM_FEATURE_BTI_DEFAULT)
+ "hint 36\n\t" // bti j
+#endif
+ "mov w1, w0\n\t"
+ "mov x0,x19\n\t"
+ "ldr x30, [x0, #%c[off_lr]]\n\t"
+ "ldr x19, [x0, #%c[off_x19]]\n\t"
+ "b __sigsetjmp_tail\n\t"
+ :
+ : [off_lr]"i"(OFF_LR),
+ [off_x19]"i"(OFF_X19)
+ : "memory", "cc"
+ );
+ __builtin_unreachable();
+}
+
+/* Export __sigsetjmp as an alias to sigsetjmp (same TU requirement). */
+__attribute__((alias("sigsetjmp"), returns_twice))
+int __sigsetjmp(jmp_buf, int);
+
diff --git a/src/signal/aarch64/sigsetjmp.s b/src/signal/aarch64/sigsetjmp.s
deleted file mode 100644
index 75910c43..00000000
--- a/src/signal/aarch64/sigsetjmp.s
+++ /dev/null
@@ -1,21 +0,0 @@
-.global sigsetjmp
-.global __sigsetjmp
-.type sigsetjmp,%function
-.type __sigsetjmp,%function
-sigsetjmp:
-__sigsetjmp:
- cbz x1,setjmp
-
- str x30,[x0,#176]
- str x19,[x0,#176+8+8]
- mov x19,x0
-
- bl setjmp
-
- mov w1,w0
- mov x0,x19
- ldr x30,[x0,#176]
- ldr x19,[x0,#176+8+8]
-
-.hidden __sigsetjmp_tail
- b __sigsetjmp_tail
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 12/14] aarch64: rewrite dlsym routine in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (10 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 11/14] aarch64: rewrite sigsetjmp " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 13/14] aarch64: rewrite memcpy " Bill Roberts
` (2 subsequent siblings)
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 dlsym routine from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in dlsym.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/ldso/aarch64/dlsym.c | 11 +++++++++++
src/ldso/aarch64/dlsym.s | 6 ------
2 files changed, 11 insertions(+), 6 deletions(-)
create mode 100644 src/ldso/aarch64/dlsym.c
delete mode 100644 src/ldso/aarch64/dlsym.s
diff --git a/src/ldso/aarch64/dlsym.c b/src/ldso/aarch64/dlsym.c
new file mode 100644
index 00000000..eeab1691
--- /dev/null
+++ b/src/ldso/aarch64/dlsym.c
@@ -0,0 +1,11 @@
+void *dlsym(void *handle, const char *name)
+{
+ __asm__ volatile(
+ "mov x2, x30\n\t" // x0 = handle, x1 = name, x2 = caller (LR)
+ "b __dlsym\n\t" // tail-call into __dlsym, which will RET to the original caller
+ :
+ :
+ : "x2"
+ );
+ __builtin_unreachable();
+}
diff --git a/src/ldso/aarch64/dlsym.s b/src/ldso/aarch64/dlsym.s
deleted file mode 100644
index abaae4d5..00000000
--- a/src/ldso/aarch64/dlsym.s
+++ /dev/null
@@ -1,6 +0,0 @@
-.global dlsym
-.hidden __dlsym
-.type dlsym,%function
-dlsym:
- mov x2,x30
- b __dlsym
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 13/14] aarch64: rewrite memcpy routine in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (11 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 12/14] aarch64: rewrite dlsym routine " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 14/14] aarch64: rewrite memset " Bill Roberts
2025-12-08 19:10 ` [musl] [RFC 00/14] aarch64: Convert to " Rich Felker
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 memcpy routine from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in memcpy.S, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/string/aarch64/memcpy.S | 186 ------------------------------------
src/string/aarch64/memcpy.c | 168 ++++++++++++++++++++++++++++++++
2 files changed, 168 insertions(+), 186 deletions(-)
delete mode 100644 src/string/aarch64/memcpy.S
create mode 100644 src/string/aarch64/memcpy.c
diff --git a/src/string/aarch64/memcpy.S b/src/string/aarch64/memcpy.S
deleted file mode 100644
index 48bb8a8d..00000000
--- a/src/string/aarch64/memcpy.S
+++ /dev/null
@@ -1,186 +0,0 @@
-/*
- * memcpy - copy memory area
- *
- * Copyright (c) 2012-2020, Arm Limited.
- * SPDX-License-Identifier: MIT
- */
-
-/* Assumptions:
- *
- * ARMv8-a, AArch64, unaligned accesses.
- *
- */
-
-#define dstin x0
-#define src x1
-#define count x2
-#define dst x3
-#define srcend x4
-#define dstend x5
-#define A_l x6
-#define A_lw w6
-#define A_h x7
-#define B_l x8
-#define B_lw w8
-#define B_h x9
-#define C_l x10
-#define C_lw w10
-#define C_h x11
-#define D_l x12
-#define D_h x13
-#define E_l x14
-#define E_h x15
-#define F_l x16
-#define F_h x17
-#define G_l count
-#define G_h dst
-#define H_l src
-#define H_h srcend
-#define tmp1 x14
-
-/* This implementation of memcpy uses unaligned accesses and branchless
- sequences to keep the code small, simple and improve performance.
-
- Copies are split into 3 main cases: small copies of up to 32 bytes, medium
- copies of up to 128 bytes, and large copies. The overhead of the overlap
- check is negligible since it is only required for large copies.
-
- Large copies use a software pipelined loop processing 64 bytes per iteration.
- The destination pointer is 16-byte aligned to minimize unaligned accesses.
- The loop tail is handled by always copying 64 bytes from the end.
-*/
-
-.global memcpy
-.type memcpy,%function
-memcpy:
- add srcend, src, count
- add dstend, dstin, count
- cmp count, 128
- b.hi .Lcopy_long
- cmp count, 32
- b.hi .Lcopy32_128
-
- /* Small copies: 0..32 bytes. */
- cmp count, 16
- b.lo .Lcopy16
- ldp A_l, A_h, [src]
- ldp D_l, D_h, [srcend, -16]
- stp A_l, A_h, [dstin]
- stp D_l, D_h, [dstend, -16]
- ret
-
- /* Copy 8-15 bytes. */
-.Lcopy16:
- tbz count, 3, .Lcopy8
- ldr A_l, [src]
- ldr A_h, [srcend, -8]
- str A_l, [dstin]
- str A_h, [dstend, -8]
- ret
-
- .p2align 3
- /* Copy 4-7 bytes. */
-.Lcopy8:
- tbz count, 2, .Lcopy4
- ldr A_lw, [src]
- ldr B_lw, [srcend, -4]
- str A_lw, [dstin]
- str B_lw, [dstend, -4]
- ret
-
- /* Copy 0..3 bytes using a branchless sequence. */
-.Lcopy4:
- cbz count, .Lcopy0
- lsr tmp1, count, 1
- ldrb A_lw, [src]
- ldrb C_lw, [srcend, -1]
- ldrb B_lw, [src, tmp1]
- strb A_lw, [dstin]
- strb B_lw, [dstin, tmp1]
- strb C_lw, [dstend, -1]
-.Lcopy0:
- ret
-
- .p2align 4
- /* Medium copies: 33..128 bytes. */
-.Lcopy32_128:
- ldp A_l, A_h, [src]
- ldp B_l, B_h, [src, 16]
- ldp C_l, C_h, [srcend, -32]
- ldp D_l, D_h, [srcend, -16]
- cmp count, 64
- b.hi .Lcopy128
- stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstin, 16]
- stp C_l, C_h, [dstend, -32]
- stp D_l, D_h, [dstend, -16]
- ret
-
- .p2align 4
- /* Copy 65..128 bytes. */
-.Lcopy128:
- ldp E_l, E_h, [src, 32]
- ldp F_l, F_h, [src, 48]
- cmp count, 96
- b.ls .Lcopy96
- ldp G_l, G_h, [srcend, -64]
- ldp H_l, H_h, [srcend, -48]
- stp G_l, G_h, [dstend, -64]
- stp H_l, H_h, [dstend, -48]
-.Lcopy96:
- stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstin, 16]
- stp E_l, E_h, [dstin, 32]
- stp F_l, F_h, [dstin, 48]
- stp C_l, C_h, [dstend, -32]
- stp D_l, D_h, [dstend, -16]
- ret
-
- .p2align 4
- /* Copy more than 128 bytes. */
-.Lcopy_long:
-
- /* Copy 16 bytes and then align dst to 16-byte alignment. */
-
- ldp D_l, D_h, [src]
- and tmp1, dstin, 15
- bic dst, dstin, 15
- sub src, src, tmp1
- add count, count, tmp1 /* Count is now 16 too large. */
- ldp A_l, A_h, [src, 16]
- stp D_l, D_h, [dstin]
- ldp B_l, B_h, [src, 32]
- ldp C_l, C_h, [src, 48]
- ldp D_l, D_h, [src, 64]!
- subs count, count, 128 + 16 /* Test and readjust count. */
- b.ls .Lcopy64_from_end
-
-.Lloop64:
- stp A_l, A_h, [dst, 16]
- ldp A_l, A_h, [src, 16]
- stp B_l, B_h, [dst, 32]
- ldp B_l, B_h, [src, 32]
- stp C_l, C_h, [dst, 48]
- ldp C_l, C_h, [src, 48]
- stp D_l, D_h, [dst, 64]!
- ldp D_l, D_h, [src, 64]!
- subs count, count, 64
- b.hi .Lloop64
-
- /* Write the last iteration and copy 64 bytes from the end. */
-.Lcopy64_from_end:
- ldp E_l, E_h, [srcend, -64]
- stp A_l, A_h, [dst, 16]
- ldp A_l, A_h, [srcend, -48]
- stp B_l, B_h, [dst, 32]
- ldp B_l, B_h, [srcend, -32]
- stp C_l, C_h, [dst, 48]
- ldp C_l, C_h, [srcend, -16]
- stp D_l, D_h, [dst, 64]
- stp E_l, E_h, [dstend, -64]
- stp A_l, A_h, [dstend, -48]
- stp B_l, B_h, [dstend, -32]
- stp C_l, C_h, [dstend, -16]
- ret
-
-.size memcpy,.-memcpy
diff --git a/src/string/aarch64/memcpy.c b/src/string/aarch64/memcpy.c
new file mode 100644
index 00000000..7ecf4cde
--- /dev/null
+++ b/src/string/aarch64/memcpy.c
@@ -0,0 +1,168 @@
+/*
+ * memcpy - copy memory area
+ *
+ * Copyright (c) 2012-2020, Arm Limited.
+ * SPDX-License-Identifier: MIT
+ */
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ This implementation of memcpy uses unaligned accesses and branchless
+ sequences to keep the code small, simple and improve performance.
+
+ Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+ copies of up to 128 bytes, and large copies. The overhead of the overlap
+ check is negligible since it is only required for large copies.
+
+ Large copies use a software pipelined loop processing 64 bytes per iteration.
+ The destination pointer is 16-byte aligned to minimize unaligned accesses.
+ The loop tail is handled by always copying 64 bytes from the end.
+*/
+
+#include <stddef.h>
+
+void *memcpy(void *dstin, const void *src, size_t count)
+{
+ __asm__ volatile(
+ // === begin original memcpy.S body, using x0/x1/x2 directly ===
+ "add x4, x1, x2\n\t"
+ "add x5, x0, x2\n\t"
+ "cmp x2, 128\n\t"
+ "b.hi .Lcopy_long\n\t"
+ "cmp x2, 32\n\t"
+ "b.hi .Lcopy32_128\n\t"
+
+ /* Small copies: 0..32 bytes. */
+ "cmp x2, 16\n\t"
+ "b.lo .Lcopy16\n\t"
+ "ldp x6, x7, [x1]\n\t"
+ "ldp x12, x13, [x4, -16]\n\t"
+ "stp x6, x7, [x0]\n\t"
+ "stp x12, x13, [x5, -16]\n\t"
+ "ret\n"
+
+ /* Copy 8-15 bytes. */
+ ".Lcopy16:\n\t"
+ "tbz x2, 3, .Lcopy8\n\t"
+ "ldr x6, [x1]\n\t"
+ "ldr x7, [x4, -8]\n\t"
+ "str x6, [x0]\n\t"
+ "str x7, [x5, -8]\n\t"
+ "ret\n\t"
+
+ ".p2align 3\n"
+ /* Copy 4-7 bytes. */
+ ".Lcopy8:\n\t"
+ "tbz x2, 2, .Lcopy4\n\t"
+ "ldr w6, [x1]\n\t"
+ "ldr w8, [x4, -4]\n\t"
+ "str w6, [x0]\n\t"
+ "str w8, [x5, -4]\n\t"
+ "ret\n"
+
+ /* Copy 0..3 bytes using a branchless sequence. */
+ ".Lcopy4:\n\t"
+ "cbz x2, .Lcopy0\n\t"
+ "lsr x14, x2, 1\n\t"
+ "ldrb w6, [x1]\n\t"
+ "ldrb w10, [x4, -1]\n\t"
+ "ldrb w8, [x1, x14]\n\t"
+ "strb w6, [x0]\n\t"
+ "strb w8, [x0, x14]\n\t"
+ "strb w10, [x5, -1]\n"
+ ".Lcopy0:\n\t"
+ "ret\n\t"
+
+ ".p2align 4\n"
+ /* Medium copies: 33..128 bytes. */
+ ".Lcopy32_128:\n\t"
+ "ldp x6, x7, [x1]\n\t"
+ "ldp x8, x9, [x1, 16]\n\t"
+ "ldp x10, x11, [x4, -32]\n\t"
+ "ldp x12, x13, [x4, -16]\n\t"
+ "cmp x2, 64\n\t"
+ "b.hi .Lcopy128\n\t"
+ "stp x6, x7, [x0]\n\t"
+ "stp x8, x9, [x0, 16]\n\t"
+ "stp x10, x11, [x5, -32]\n\t"
+ "stp x12, x13, [x5, -16]\n\t"
+ "ret\n\t"
+
+ ".p2align 4\n"
+ /* Copy 65..128 bytes. */
+ ".Lcopy128:\n\t"
+ "ldp x14, x15, [x1, 32]\n\t"
+ "ldp x16, x17, [x1, 48]\n\t"
+ "cmp x2, 96\n\t"
+ "b.ls .Lcopy96\n\t"
+ "ldp x2, x3, [x4, -64]\n\t" // G_l, G_h
+ "ldp x1, x4, [x4, -48]\n\t" // H_l, H_h (uses x1/x4 temporarily)
+ "stp x2, x3, [x5, -64]\n\t"
+ "stp x1, x4, [x5, -48]\n"
+ ".Lcopy96:\n\t"
+ "stp x6, x7, [x0]\n\t"
+ "stp x8, x9, [x0, 16]\n\t"
+ "stp x14, x15, [x0, 32]\n\t"
+ "stp x16, x17, [x0, 48]\n\t"
+ "stp x10, x11, [x5, -32]\n\t"
+ "stp x12, x13, [x5, -16]\n\t"
+ "ret\n\t"
+
+ ".p2align 4\n"
+ /* Copy more than 128 bytes. */
+ ".Lcopy_long:\n"
+ /* Copy 16 bytes and then align dst to 16-byte alignment. */
+ "ldp x12, x13, [x1]\n\t"
+ "and x14, x0, 15\n\t"
+ "bic x3, x0, 15\n\t" // dst = x3
+ "sub x1, x1, x14\n\t" // src -= tmp1
+ "add x2, x2, x14\n\t" // count += tmp1 (now 16 too large)
+ "ldp x6, x7, [x1, 16]\n\t"
+ "stp x12, x13, [x0]\n\t"
+ "ldp x8, x9, [x1, 32]\n\t"
+ "ldp x10, x11, [x1, 48]\n\t"
+ "ldp x12, x13, [x1, 64]!\n\t"
+ "subs x2, x2, 128 + 16\n\t"
+ "b.ls .Lcopy64_from_end\n"
+
+ ".Lloop64:\n\t"
+ "stp x6, x7, [x3, 16]\n\t"
+ "ldp x6, x7, [x1, 16]\n\t"
+ "stp x8, x9, [x3, 32]\n\t"
+ "ldp x8, x9, [x1, 32]\n\t"
+ "stp x10, x11, [x3, 48]\n\t"
+ "ldp x10, x11, [x1, 48]\n\t"
+ "stp x12, x13, [x3, 64]!\n\t"
+ "ldp x12, x13, [x1, 64]!\n\t"
+ "subs x2, x2, 64\n\t"
+ "b.hi .Lloop64\n"
+
+ /* Write the last iteration and copy 64 bytes from the end. */
+ ".Lcopy64_from_end:\n\t"
+ "ldp x14, x15, [x4, -64]\n\t"
+ "stp x6, x7, [x3, 16]\n\t"
+ "ldp x6, x7, [x4, -48]\n\t"
+ "stp x8, x9, [x3, 32]\n\t"
+ "ldp x8, x9, [x4, -32]\n\t"
+ "stp x10, x11, [x3, 48]\n\t"
+ "ldp x10, x11, [x4, -16]\n\t"
+ "stp x12, x13, [x3, 64]\n\t"
+ "stp x14, x15, [x5, -64]\n\t"
+ "stp x6, x7, [x5, -48]\n\t"
+ "stp x8, x9, [x5, -32]\n\t"
+ "stp x10, x11, [x5, -16]\n\t"
+ "ret\n\t"
+ :
+ :
+ : "x3","x4","x5",
+ "x6","x7","x8","x9",
+ "x10","x11","x12","x13",
+ "x14","x15","x16","x17",
+ "cc","memory"
+ );
+
+ __builtin_unreachable(); // we always return via the 'ret' in asm
+}
+
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [musl] [RFC 14/14] aarch64: rewrite memset routine in C using inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (12 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 13/14] aarch64: rewrite memcpy " Bill Roberts
@ 2025-12-08 17:44 ` Bill Roberts
2025-12-08 19:10 ` [musl] [RFC 00/14] aarch64: Convert to " Rich Felker
14 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-08 17:44 UTC (permalink / raw)
To: musl; +Cc: Bill Roberts
Rewrite the AArch64 memset routine from assembly into
implementations using inline assembly.
This change eliminates the need for handwritten function prologues and
epilogues in memset.s, which simplifies maintenance and allows the compiler
to automatically insert architecture features such as BTI landing pads and
pointer authentication (PAC) sequences where applicable.
Moving to C also enables the compiler to manage register allocation,
stack usage, and ABI compliance automatically while keeping the low-level
behavior (bitmasks and register accesses) explicit and verifiable.
No functional changes intended.
Signed-off-by: Bill Roberts <bill.roberts@arm.com>
---
src/string/aarch64/memset.S | 115 ---------------------------------
src/string/aarch64/memset.c | 122 ++++++++++++++++++++++++++++++++++++
2 files changed, 122 insertions(+), 115 deletions(-)
delete mode 100644 src/string/aarch64/memset.S
create mode 100644 src/string/aarch64/memset.c
diff --git a/src/string/aarch64/memset.S b/src/string/aarch64/memset.S
deleted file mode 100644
index f0d29b7f..00000000
--- a/src/string/aarch64/memset.S
+++ /dev/null
@@ -1,115 +0,0 @@
-/*
- * memset - fill memory with a constant byte
- *
- * Copyright (c) 2012-2020, Arm Limited.
- * SPDX-License-Identifier: MIT
- */
-
-/* Assumptions:
- *
- * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses.
- *
- */
-
-#define dstin x0
-#define val x1
-#define valw w1
-#define count x2
-#define dst x3
-#define dstend x4
-#define zva_val x5
-
-.global memset
-.type memset,%function
-memset:
-
- dup v0.16B, valw
- add dstend, dstin, count
-
- cmp count, 96
- b.hi .Lset_long
- cmp count, 16
- b.hs .Lset_medium
- mov val, v0.D[0]
-
- /* Set 0..15 bytes. */
- tbz count, 3, 1f
- str val, [dstin]
- str val, [dstend, -8]
- ret
- nop
-1: tbz count, 2, 2f
- str valw, [dstin]
- str valw, [dstend, -4]
- ret
-2: cbz count, 3f
- strb valw, [dstin]
- tbz count, 1, 3f
- strh valw, [dstend, -2]
-3: ret
-
- /* Set 17..96 bytes. */
-.Lset_medium:
- str q0, [dstin]
- tbnz count, 6, .Lset96
- str q0, [dstend, -16]
- tbz count, 5, 1f
- str q0, [dstin, 16]
- str q0, [dstend, -32]
-1: ret
-
- .p2align 4
- /* Set 64..96 bytes. Write 64 bytes from the start and
- 32 bytes from the end. */
-.Lset96:
- str q0, [dstin, 16]
- stp q0, q0, [dstin, 32]
- stp q0, q0, [dstend, -32]
- ret
-
- .p2align 4
-.Lset_long:
- and valw, valw, 255
- bic dst, dstin, 15
- str q0, [dstin]
- cmp count, 160
- ccmp valw, 0, 0, hs
- b.ne .Lno_zva
-
-#ifndef SKIP_ZVA_CHECK
- mrs zva_val, dczid_el0
- and zva_val, zva_val, 31
- cmp zva_val, 4 /* ZVA size is 64 bytes. */
- b.ne .Lno_zva
-#endif
- str q0, [dst, 16]
- stp q0, q0, [dst, 32]
- bic dst, dst, 63
- sub count, dstend, dst /* Count is now 64 too large. */
- sub count, count, 128 /* Adjust count and bias for loop. */
-
- .p2align 4
-.Lzva_loop:
- add dst, dst, 64
- dc zva, dst
- subs count, count, 64
- b.hi .Lzva_loop
- stp q0, q0, [dstend, -64]
- stp q0, q0, [dstend, -32]
- ret
-
-.Lno_zva:
- sub count, dstend, dst /* Count is 16 too large. */
- sub dst, dst, 16 /* Dst is biased by -32. */
- sub count, count, 64 + 16 /* Adjust count and bias for loop. */
-.Lno_zva_loop:
- stp q0, q0, [dst, 32]
- stp q0, q0, [dst, 64]!
- subs count, count, 64
- b.hi .Lno_zva_loop
- stp q0, q0, [dstend, -64]
- stp q0, q0, [dstend, -32]
- ret
-
-.size memset,.-memset
-
diff --git a/src/string/aarch64/memset.c b/src/string/aarch64/memset.c
new file mode 100644
index 00000000..dfc820c6
--- /dev/null
+++ b/src/string/aarch64/memset.c
@@ -0,0 +1,122 @@
+/*
+ * memset - fill memory with a constant byte
+ *
+ * Copyright (c) 2012-2020, Arm Limited.
+ * SPDX-License-Identifier: MIT
+ */
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses.
+ *
+ */
+
+#include <stddef.h>
+
+void *memset(void *dstin, int c, size_t count)
+{
+ __asm__ volatile(
+ "dup v0.16B, w1\n\t"
+ "add x4, x0, x2\n\t" // dstend = dstin + count
+
+ "cmp x2, 96\n\t"
+ "b.hi .Lset_long\n\t"
+ "cmp x2, 16\n\t"
+ "b.hs .Lset_medium\n\t"
+ "mov x1, v0.D[0]\n\t"
+
+ /* Set 0..15 bytes. */
+ "tbz x2, 3, 1f\n\t"
+ "str x1, [x0]\n\t"
+ "str x1, [x4, -8]\n\t"
+ "ret\n\t"
+ "nop\n"
+
+ "1:\n\t"
+ "tbz x2, 2, 2f\n\t"
+ "str w1, [x0]\n\t"
+ "str w1, [x4, -4]\n\t"
+ "ret\n"
+
+ "2:\n\t"
+ "cbz x2, 3f\n\t"
+ "strb w1, [x0]\n\t"
+ "tbz x2, 1, 3f\n\t"
+ "strh w1, [x4, -2]\n"
+ "3:\n\t"
+ "ret\n"
+
+ /* Set 17..96 bytes. */
+ ".Lset_medium:\n\t"
+ "str q0, [x0]\n\t"
+ "tbnz x2, 6, .Lset96\n\t"
+ "str q0, [x4, -16]\n\t"
+ "tbz x2, 5, 1f\n\t"
+ "str q0, [x0, 16]\n\t"
+ "str q0, [x4, -32]\n"
+ "1:\n\t"
+ "ret\n\t"
+
+ ".p2align 4\n"
+ /* Set 64..96 bytes. Write 64 bytes from the start and
+ 32 bytes from the end. */
+ ".Lset96:\n\t"
+ "str q0, [x0, 16]\n\t"
+ "stp q0, q0, [x0, 32]\n\t"
+ "stp q0, q0, [x4, -32]\n\t"
+ "ret\n\t"
+
+ ".p2align 4\n"
+ ".Lset_long:\n\t"
+ "and w1, w1, 255\n\t"
+ "bic x3, x0, 15\n\t"
+ "str q0, [x0]\n\t"
+ "cmp x2, 160\n\t"
+ "ccmp w1, 0, 0, hs\n\t"
+ "b.ne .Lno_zva\n\t"
+
+#ifndef SKIP_ZVA_CHECK
+ "mrs x5, dczid_el0\n\t"
+ "and x5, x5, 31\n\t"
+ "cmp x5, 4\n\t" /* ZVA size is 64 bytes. */
+ "b.ne .Lno_zva\n\t"
+#endif
+ "str q0, [x3, 16]\n\t"
+ "stp q0, q0, [x3, 32]\n\t"
+ "bic x3, x3, 63\n\t"
+ "sub x2, x4, x3\n\t" /* Count is now 64 too large. */
+ "sub x2, x2, 128\n\t" /* Adjust count and bias for loop. */
+
+ ".p2align 4\n"
+ ".Lzva_loop:\n\t"
+ "add x3, x3, 64\n\t"
+ "dc zva, x3\n\t"
+ "subs x2, x2, 64\n\t"
+ "b.hi .Lzva_loop\n\t"
+ "stp q0, q0, [x4, -64]\n\t"
+ "stp q0, q0, [x4, -32]\n\t"
+ "ret\n"
+
+ ".Lno_zva:\n\t"
+ "sub x2, x4, x3\n\t" /* Count is 16 too large. */
+ "sub x3, x3, 16\n\t" /* Dst is biased by -32. */
+ "sub x2, x2, 64 + 16\n" /* Adjust count and bias for loop. */
+
+ ".Lno_zva_loop:\n\t"
+ "stp q0, q0, [x3, 32]\n\t"
+ "stp q0, q0, [x3, 64]!\n\t"
+ "subs x2, x2, 64\n\t"
+ "b.hi .Lno_zva_loop\n\t"
+ "stp q0, q0, [x4, -64]\n\t"
+ "stp q0, q0, [x4, -32]\n\t"
+ "ret\n\t"
+ :
+ :
+ : "x3", "x4", "x5", // dst, dstend, zva_val
+ "v0", // SIMD register used for pattern
+ "cc", "memory"
+ );
+
+ __builtin_unreachable(); // all returns are via the asm 'ret' paths
+}
+
--
2.51.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
` (13 preceding siblings ...)
2025-12-08 17:44 ` [musl] [RFC 14/14] aarch64: rewrite memset " Bill Roberts
@ 2025-12-08 19:10 ` Rich Felker
2025-12-14 2:22 ` Demi Marie Obenour
2025-12-18 8:34 ` Bill Roberts
14 siblings, 2 replies; 30+ messages in thread
From: Rich Felker @ 2025-12-08 19:10 UTC (permalink / raw)
To: Bill Roberts; +Cc: musl
On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
> Based on previous discussions on enabling PAC and BTI for Aarch64
> targets, rather than annotating the existing assembler, use inline
> assembly and mix of C. Now this has the benefits of:
> 1. Handling PAC, BTI and GCS.
> a. prologue and eplilog insertion as needed.
> b. Adding GNU notes as needed.
> 2. Adding in the CFI statements as needed.
>
> I'd love to get feedback, thanks!
>
> Bill Roberts (14):
> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
> aarch64: rewrite fenv routines in C using inline asm
> aarch64: rewrite vfork routine in C using inline asm
> aarch64: rewrite clone routine in C using inline asm
> aarch64: rewrite __syscall_cp_asm in C using inline asm
> aarch64: rewrite __unmapself in C using inline asm
> aarch64: rewrite tlsdesc reoutines in C using inline asm
> aarch64: rewrite __restore_rt routines in C using inline asm
> aarch64: rewrite longjmp routines in C using inline asm
> aarch64: rewrite setjmp routines in C using inline asm
> aarch64: rewrite sigsetjmp routines in C using inline asm
> aarch64: rewrite dlsym routine in C using inline asm
> aarch64: rewrite memcpy routine in C using inline asm
> aarch64: rewrite memset routine in C using inline asm
Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
and dlsym are fundamentally wrong in that they have to be asm entry
points. Wrapping them in C breaks the state they need to receive.
Some others like __syscall_cp_asm are wrong by virtue of putting
symbol definitions inside inline asm, which may be emitted a different
number of times than it appears in the source. The labels in
__syscall_cp_asm must exist only once, so it really needs to be
external asm (for a slightly different reason than the entry point
needing to be asm).
The advice to move to inline asm was to do it where possible, i.e.
where it's gratuitous that we had an asm source file. But even where
this can be done, it should be done by actually writing the inline asm
with proper register constraints, not just copy-pasting the asm into C
files wrapped in __asm__. Some things, like __clone, even if they
could be done as C source files with asm, are not valid the way you've
just wrapped them because you're performing a return from within the
asm but don't have access to the return address or any way to undo
potential stack adjustments made in prologue before the __asm__. And
this would catastrophically break if LTO'd.
memcpy and memset are slated for "removal" at some point, replacing
the high level flow logic in arch-specific asm with shared high level
C and arch-provided asm only for the middle-section bulk copy/fill
operation in aligned and unaligned variants. I'm really not up for
reviewing and trusting in the correctness of large changes to any of
the existing arch-specific memcpy/memset asm or adding new ones for
other archs until then, because it's effort on something that's
intended to be removed. So these should just be kept as-is for now.
The approach to the fenv changes looks roughly right. This is also
something I'd like to do in an arch-generic way at some point, but
there's no good reason not to do it first on aarch64 like you've
proposed.
Removing crt[in].s is probably okay as well.
We generally prefer patch series as a single email with multiple MIME
attachments instead of git send-email threads, if that's easy for you
to do. It's not a big deal either way but it keeps folks' inbox volume
down and makes it easier to reply with review of the whole series
together.
Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 03/14] aarch64: rewrite vfork routine in C using inline asm
2025-12-08 17:44 ` [musl] [RFC 03/14] aarch64: rewrite vfork routine " Bill Roberts
@ 2025-12-11 12:09 ` Florian Weimer
2025-12-12 2:34 ` Rich Felker
2025-12-18 10:33 ` Bill Roberts
0 siblings, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2025-12-11 12:09 UTC (permalink / raw)
To: Bill Roberts; +Cc: musl
* Bill Roberts:
> diff --git a/src/process/aarch64/vfork.c b/src/process/aarch64/vfork.c
> new file mode 100644
> index 00000000..87ec8ebf
> --- /dev/null
> +++ b/src/process/aarch64/vfork.c
> @@ -0,0 +1,21 @@
> +#include <sys/types.h>
> +
> +#include "syscall.h"
> +
> +pid_t vfork(void)
> +{
> + /* aarch64 Linux syscall: x8 = nr, x0..x5 = args, ret in x0 */
> + register long x8 __asm__("x8") = 220; /* SYS_clone */
> + register long x0 __asm__("x0") = 0x4111; /* SIGCHLD | CLONE_VM | CLONE_VFORK */
> + register long x1 __asm__("x1") = 0; /* arg2 = 0 */
> +
> + __asm__ volatile (
> + "svc 0\n\t"
> + ".hidden __syscall_ret\n\t"
> + "b __syscall_ret\n\t"
> + : "+r"(x0) /* x0 = in/out */
> + : "r"(x1), "r"(x8) /* inputs */
> + : "memory", "cc"
> + );
> + __builtin_unreachable();
> +}
This is incompatible with building with -fstack-protector-all, isn't it?
Thanks,
Florian
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines in C using inline asm
2025-12-08 17:44 ` [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines " Bill Roberts
@ 2025-12-11 12:10 ` Florian Weimer
0 siblings, 0 replies; 30+ messages in thread
From: Florian Weimer @ 2025-12-11 12:10 UTC (permalink / raw)
To: Bill Roberts; +Cc: musl
* Bill Roberts:
> Rewrite the AArch64 __tlsdesc_dynamic and __tlsdesc_static
> routines from assembly into implementations using inline assembly.
>
> This change eliminates the need for handwritten function prologues and
> epilogues in tlsdesc.s, which simplifies maintenance and allows the compiler
> to automatically insert architecture features such as BTI landing pads and
> pointer authentication (PAC) sequences where applicable.
>
> Moving to C also enables the compiler to manage register allocation,
> stack usage, and ABI compliance automatically while keeping the low-level
> behavior (bitmasks and register accesses) explicit and verifiable.
>
> No functional changes intended.
>
> Signed-off-by: Bill Roberts <bill.roberts@arm.com>
> ---
> src/ldso/aarch64/tlsdesc.c | 50 ++++++++++++++++++++++++++++++++++++++
> src/ldso/aarch64/tlsdesc.s | 31 -----------------------
> 2 files changed, 50 insertions(+), 31 deletions(-)
> create mode 100644 src/ldso/aarch64/tlsdesc.c
> delete mode 100644 src/ldso/aarch64/tlsdesc.s
>
> diff --git a/src/ldso/aarch64/tlsdesc.c b/src/ldso/aarch64/tlsdesc.c
> new file mode 100644
> index 00000000..224a9387
> --- /dev/null
> +++ b/src/ldso/aarch64/tlsdesc.c
> @@ -0,0 +1,50 @@
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +/* size_t __tlsdesc_static(size_t *a) { return a[1]; } */
> +__attribute__((visibility("hidden")))
> +size_t __tlsdesc_static(size_t *a)
> +{
> + size_t result;
> +
> + __asm__ __volatile__(
> + "ldr %0, [%1, #8]\n\t" /* result = *(a + 8) */
> + : "=r"(result)
> + : "r"(a)
> + : "memory"
> + );
> +
> + return result;
> +}
I don't think these descriptor functions use the default AArch64 psABI,
so they can't be written in C (unless special compiler flags or function
attributes are used).
Thanks,
Florian
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 03/14] aarch64: rewrite vfork routine in C using inline asm
2025-12-11 12:09 ` Florian Weimer
@ 2025-12-12 2:34 ` Rich Felker
2025-12-18 10:33 ` Bill Roberts
1 sibling, 0 replies; 30+ messages in thread
From: Rich Felker @ 2025-12-12 2:34 UTC (permalink / raw)
To: Florian Weimer; +Cc: Bill Roberts, musl
On Thu, Dec 11, 2025 at 01:09:50PM +0100, Florian Weimer wrote:
> * Bill Roberts:
>
> > diff --git a/src/process/aarch64/vfork.c b/src/process/aarch64/vfork.c
> > new file mode 100644
> > index 00000000..87ec8ebf
> > --- /dev/null
> > +++ b/src/process/aarch64/vfork.c
> > @@ -0,0 +1,21 @@
> > +#include <sys/types.h>
> > +
> > +#include "syscall.h"
> > +
> > +pid_t vfork(void)
> > +{
> > + /* aarch64 Linux syscall: x8 = nr, x0..x5 = args, ret in x0 */
> > + register long x8 __asm__("x8") = 220; /* SYS_clone */
> > + register long x0 __asm__("x0") = 0x4111; /* SIGCHLD | CLONE_VM | CLONE_VFORK */
> > + register long x1 __asm__("x1") = 0; /* arg2 = 0 */
> > +
> > + __asm__ volatile (
> > + "svc 0\n\t"
> > + ".hidden __syscall_ret\n\t"
> > + "b __syscall_ret\n\t"
> > + : "+r"(x0) /* x0 = in/out */
> > + : "r"(x1), "r"(x8) /* inputs */
> > + : "memory", "cc"
> > + );
> > + __builtin_unreachable();
> > +}
>
> This is incompatible with building with -fstack-protector-all, isn't it?
As noted in my reply to the 00/14 cover letter for the patch series, a
number of these are not valid. This one is making a tail call from
inline asm with the stack in unknown state (invalid use of inline
asm), and more fundamentally, vfork cannot be implemented in C.
Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-08 19:10 ` [musl] [RFC 00/14] aarch64: Convert to " Rich Felker
@ 2025-12-14 2:22 ` Demi Marie Obenour
2025-12-14 15:18 ` Rich Felker
2025-12-14 20:06 ` James Y Knight
2025-12-18 8:34 ` Bill Roberts
1 sibling, 2 replies; 30+ messages in thread
From: Demi Marie Obenour @ 2025-12-14 2:22 UTC (permalink / raw)
To: musl, Rich Felker, Bill Roberts
[-- Attachment #1.1.1: Type: text/plain, Size: 4005 bytes --]
On 12/8/25 14:10, Rich Felker wrote:
> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
>> Based on previous discussions on enabling PAC and BTI for Aarch64
>> targets, rather than annotating the existing assembler, use inline
>> assembly and mix of C. Now this has the benefits of:
>> 1. Handling PAC, BTI and GCS.
>> a. prologue and eplilog insertion as needed.
>> b. Adding GNU notes as needed.
>> 2. Adding in the CFI statements as needed.
>>
>> I'd love to get feedback, thanks!
>>
>> Bill Roberts (14):
>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
>> aarch64: rewrite fenv routines in C using inline asm
>> aarch64: rewrite vfork routine in C using inline asm
>> aarch64: rewrite clone routine in C using inline asm
>> aarch64: rewrite __syscall_cp_asm in C using inline asm
>> aarch64: rewrite __unmapself in C using inline asm
>> aarch64: rewrite tlsdesc reoutines in C using inline asm
>> aarch64: rewrite __restore_rt routines in C using inline asm
>> aarch64: rewrite longjmp routines in C using inline asm
>> aarch64: rewrite setjmp routines in C using inline asm
>> aarch64: rewrite sigsetjmp routines in C using inline asm
>> aarch64: rewrite dlsym routine in C using inline asm
>> aarch64: rewrite memcpy routine in C using inline asm
>> aarch64: rewrite memset routine in C using inline asm
>
> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
> and dlsym are fundamentally wrong in that they have to be asm entry
> points. Wrapping them in C breaks the state they need to receive.
>
> Some others like __syscall_cp_asm are wrong by virtue of putting
> symbol definitions inside inline asm, which may be emitted a different
> number of times than it appears in the source. The labels in
> __syscall_cp_asm must exist only once, so it really needs to be
> external asm (for a slightly different reason than the entry point
> needing to be asm).
>
> The advice to move to inline asm was to do it where possible, i.e.
> where it's gratuitous that we had an asm source file. But even where
> this can be done, it should be done by actually writing the inline asm
> with proper register constraints, not just copy-pasting the asm into C
> files wrapped in __asm__. Some things, like __clone, even if they
> could be done as C source files with asm, are not valid the way you've
> just wrapped them because you're performing a return from within the
> asm but don't have access to the return address or any way to undo
> potential stack adjustments made in prologue before the __asm__. And
> this would catastrophically break if LTO'd.
>
> memcpy and memset are slated for "removal" at some point, replacing
> the high level flow logic in arch-specific asm with shared high level
> C and arch-provided asm only for the middle-section bulk copy/fill
> operation in aligned and unaligned variants. I'm really not up for
> reviewing and trusting in the correctness of large changes to any of
> the existing arch-specific memcpy/memset asm or adding new ones for
> other archs until then, because it's effort on something that's
> intended to be removed. So these should just be kept as-is for now.
There is code in the wild that relies on memcpy not actually causing
data races, even though the C standard says otherwise. The problem is
that the standard provides literally no option for accessing memory
that may be concurrently modified by untrusted code, even doing so
in assembly is perfectly okay.
To avoid data races, this code would need to be rewritten to use
assembly code for various architectures. I doubt this is a feasible
solution.
The proper fix is for the standard to include bytewise-atomic memcpy.
Until then, people will use hacks like this.
As a quality of implementation matter, I strongly recommend that all
accesses by memcpy() to user buffers happen in assembly code.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 2:22 ` Demi Marie Obenour
@ 2025-12-14 15:18 ` Rich Felker
2025-12-14 19:11 ` Demi Marie Obenour
2025-12-14 20:06 ` James Y Knight
1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2025-12-14 15:18 UTC (permalink / raw)
To: Demi Marie Obenour; +Cc: musl, Bill Roberts
On Sat, Dec 13, 2025 at 09:22:43PM -0500, Demi Marie Obenour wrote:
> On 12/8/25 14:10, Rich Felker wrote:
> > On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
> >> Based on previous discussions on enabling PAC and BTI for Aarch64
> >> targets, rather than annotating the existing assembler, use inline
> >> assembly and mix of C. Now this has the benefits of:
> >> 1. Handling PAC, BTI and GCS.
> >> a. prologue and eplilog insertion as needed.
> >> b. Adding GNU notes as needed.
> >> 2. Adding in the CFI statements as needed.
> >>
> >> I'd love to get feedback, thanks!
> >>
> >> Bill Roberts (14):
> >> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
> >> aarch64: rewrite fenv routines in C using inline asm
> >> aarch64: rewrite vfork routine in C using inline asm
> >> aarch64: rewrite clone routine in C using inline asm
> >> aarch64: rewrite __syscall_cp_asm in C using inline asm
> >> aarch64: rewrite __unmapself in C using inline asm
> >> aarch64: rewrite tlsdesc reoutines in C using inline asm
> >> aarch64: rewrite __restore_rt routines in C using inline asm
> >> aarch64: rewrite longjmp routines in C using inline asm
> >> aarch64: rewrite setjmp routines in C using inline asm
> >> aarch64: rewrite sigsetjmp routines in C using inline asm
> >> aarch64: rewrite dlsym routine in C using inline asm
> >> aarch64: rewrite memcpy routine in C using inline asm
> >> aarch64: rewrite memset routine in C using inline asm
> >
> > Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
> > and dlsym are fundamentally wrong in that they have to be asm entry
> > points. Wrapping them in C breaks the state they need to receive.
> >
> > Some others like __syscall_cp_asm are wrong by virtue of putting
> > symbol definitions inside inline asm, which may be emitted a different
> > number of times than it appears in the source. The labels in
> > __syscall_cp_asm must exist only once, so it really needs to be
> > external asm (for a slightly different reason than the entry point
> > needing to be asm).
> >
> > The advice to move to inline asm was to do it where possible, i.e.
> > where it's gratuitous that we had an asm source file. But even where
> > this can be done, it should be done by actually writing the inline asm
> > with proper register constraints, not just copy-pasting the asm into C
> > files wrapped in __asm__. Some things, like __clone, even if they
> > could be done as C source files with asm, are not valid the way you've
> > just wrapped them because you're performing a return from within the
> > asm but don't have access to the return address or any way to undo
> > potential stack adjustments made in prologue before the __asm__. And
> > this would catastrophically break if LTO'd.
> >
> > memcpy and memset are slated for "removal" at some point, replacing
> > the high level flow logic in arch-specific asm with shared high level
> > C and arch-provided asm only for the middle-section bulk copy/fill
> > operation in aligned and unaligned variants. I'm really not up for
> > reviewing and trusting in the correctness of large changes to any of
> > the existing arch-specific memcpy/memset asm or adding new ones for
> > other archs until then, because it's effort on something that's
> > intended to be removed. So these should just be kept as-is for now.
>
> There is code in the wild that relies on memcpy not actually causing
> data races, even though the C standard says otherwise. The problem is
> that the standard provides literally no option for accessing memory
> that may be concurrently modified by untrusted code, even doing so
> in assembly is perfectly okay.
>
> To avoid data races, this code would need to be rewritten to use
> assembly code for various architectures. I doubt this is a feasible
> solution.
>
> The proper fix is for the standard to include bytewise-atomic memcpy.
> Until then, people will use hacks like this.
> As a quality of implementation matter, I strongly recommend that all
> accesses by memcpy() to user buffers happen in assembly code.
musl does not go out of its way to facilitate gross UB by
applications. If anything (if it's detectable and detected at low
cost, or if it's so high-risk that some cost is acceptable) we trap
and immediately crash when it's detected. We don't just make it
silently "do what the programmer wanted".
If a program is operating on memory that may change asynchronously out
from under it, it needs to be using the appropriate volatile or atomic
qualifications. memcpy very intentionally does not take a volatile
void * because it's not valid to pass pointers to such memory to
memcpy.
Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 15:18 ` Rich Felker
@ 2025-12-14 19:11 ` Demi Marie Obenour
2025-12-14 20:08 ` Rich Felker
0 siblings, 1 reply; 30+ messages in thread
From: Demi Marie Obenour @ 2025-12-14 19:11 UTC (permalink / raw)
To: Rich Felker; +Cc: musl, Bill Roberts
[-- Attachment #1.1.1: Type: text/plain, Size: 5589 bytes --]
On 12/14/25 10:18, Rich Felker wrote:
> On Sat, Dec 13, 2025 at 09:22:43PM -0500, Demi Marie Obenour wrote:
>> On 12/8/25 14:10, Rich Felker wrote:
>>> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
>>>> Based on previous discussions on enabling PAC and BTI for Aarch64
>>>> targets, rather than annotating the existing assembler, use inline
>>>> assembly and mix of C. Now this has the benefits of:
>>>> 1. Handling PAC, BTI and GCS.
>>>> a. prologue and eplilog insertion as needed.
>>>> b. Adding GNU notes as needed.
>>>> 2. Adding in the CFI statements as needed.
>>>>
>>>> I'd love to get feedback, thanks!
>>>>
>>>> Bill Roberts (14):
>>>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
>>>> aarch64: rewrite fenv routines in C using inline asm
>>>> aarch64: rewrite vfork routine in C using inline asm
>>>> aarch64: rewrite clone routine in C using inline asm
>>>> aarch64: rewrite __syscall_cp_asm in C using inline asm
>>>> aarch64: rewrite __unmapself in C using inline asm
>>>> aarch64: rewrite tlsdesc reoutines in C using inline asm
>>>> aarch64: rewrite __restore_rt routines in C using inline asm
>>>> aarch64: rewrite longjmp routines in C using inline asm
>>>> aarch64: rewrite setjmp routines in C using inline asm
>>>> aarch64: rewrite sigsetjmp routines in C using inline asm
>>>> aarch64: rewrite dlsym routine in C using inline asm
>>>> aarch64: rewrite memcpy routine in C using inline asm
>>>> aarch64: rewrite memset routine in C using inline asm
>>>
>>> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
>>> and dlsym are fundamentally wrong in that they have to be asm entry
>>> points. Wrapping them in C breaks the state they need to receive.
>>>
>>> Some others like __syscall_cp_asm are wrong by virtue of putting
>>> symbol definitions inside inline asm, which may be emitted a different
>>> number of times than it appears in the source. The labels in
>>> __syscall_cp_asm must exist only once, so it really needs to be
>>> external asm (for a slightly different reason than the entry point
>>> needing to be asm).
>>>
>>> The advice to move to inline asm was to do it where possible, i.e.
>>> where it's gratuitous that we had an asm source file. But even where
>>> this can be done, it should be done by actually writing the inline asm
>>> with proper register constraints, not just copy-pasting the asm into C
>>> files wrapped in __asm__. Some things, like __clone, even if they
>>> could be done as C source files with asm, are not valid the way you've
>>> just wrapped them because you're performing a return from within the
>>> asm but don't have access to the return address or any way to undo
>>> potential stack adjustments made in prologue before the __asm__. And
>>> this would catastrophically break if LTO'd.
>>>
>>> memcpy and memset are slated for "removal" at some point, replacing
>>> the high level flow logic in arch-specific asm with shared high level
>>> C and arch-provided asm only for the middle-section bulk copy/fill
>>> operation in aligned and unaligned variants. I'm really not up for
>>> reviewing and trusting in the correctness of large changes to any of
>>> the existing arch-specific memcpy/memset asm or adding new ones for
>>> other archs until then, because it's effort on something that's
>>> intended to be removed. So these should just be kept as-is for now.
>>
>> There is code in the wild that relies on memcpy not actually causing
>> data races, even though the C standard says otherwise. The problem is
>> that the standard provides literally no option for accessing memory
>> that may be concurrently modified by untrusted code, even doing so
>> in assembly is perfectly okay.
>>
>> To avoid data races, this code would need to be rewritten to use
>> assembly code for various architectures. I doubt this is a feasible
>> solution.
>>
>> The proper fix is for the standard to include bytewise-atomic memcpy.
>> Until then, people will use hacks like this.
>> As a quality of implementation matter, I strongly recommend that all
>> accesses by memcpy() to user buffers happen in assembly code.
>
> musl does not go out of its way to facilitate gross UB by
> applications. If anything (if it's detectable and detected at low
> cost, or if it's so high-risk that some cost is acceptable) we trap
> and immediately crash when it's detected. We don't just make it
> silently "do what the programmer wanted".
>
> If a program is operating on memory that may change asynchronously out
> from under it, it needs to be using the appropriate volatile or atomic
> qualifications. memcpy very intentionally does not take a volatile
> void * because it's not valid to pass pointers to such memory to
> memcpy.
>
> Rich
People do this because they have no other decent choices.
Doing anything else in C winds winds up with a 40x slowdown because the
compiler cannot optimize anything. Using hand-written assembly is a
maintenance nightmare. volatile is for MMIO, and atomics are for where
synchronization is needed. Both severely over-constrain the compiler.
Probably the best currently available solution is to come up with an
assembly code library that does the job. But that's going to be a
lot of work, and it won't fix the (likely *many*) programs with this
bug in the wild. Some of which are security-critical.
Is the risk of breaking existing code worth theoretical purity here?
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 2:22 ` Demi Marie Obenour
2025-12-14 15:18 ` Rich Felker
@ 2025-12-14 20:06 ` James Y Knight
2025-12-15 22:18 ` Demi Marie Obenour
1 sibling, 1 reply; 30+ messages in thread
From: James Y Knight @ 2025-12-14 20:06 UTC (permalink / raw)
To: musl; +Cc: Rich Felker, Bill Roberts
[-- Attachment #1: Type: text/plain, Size: 1127 bytes --]
On Sat, Dec 13, 2025, 9:23 PM Demi Marie Obenour <demiobenour@gmail.com>
wrote:
> There is code in the wild that relies on memcpy not actually causing
> data races, even though the C standard says otherwise. The problem is
> that the standard provides literally no option for accessing memory
> that may be concurrently modified by untrusted code, even doing so
> in assembly is perfectly okay.
>
> To avoid data races, this code would need to be rewritten to use
> assembly code for various architectures. I doubt this is a feasible
> solution.
>
> The proper fix is for the standard to include bytewise-atomic memcpy.
> Until then, people will use hacks like this.
> As a quality of implementation matter, I strongly recommend that all
> accesses by memcpy() to user buffers happen in assembly code.
>
I don't see how software could be usefully relying on memcpy being
implemented in asm for correctness/security, when common C compilers
recognize and optimize calls to memcpy under a data-race-free assumption
regardless of how the C library implements the out-of-line version of the
function.
[-- Attachment #2: Type: text/html, Size: 1677 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 19:11 ` Demi Marie Obenour
@ 2025-12-14 20:08 ` Rich Felker
2025-12-14 20:52 ` Demi Marie Obenour
0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2025-12-14 20:08 UTC (permalink / raw)
To: Demi Marie Obenour; +Cc: musl, Bill Roberts
On Sun, Dec 14, 2025 at 02:11:49PM -0500, Demi Marie Obenour wrote:
> On 12/14/25 10:18, Rich Felker wrote:
> > On Sat, Dec 13, 2025 at 09:22:43PM -0500, Demi Marie Obenour wrote:
> >> On 12/8/25 14:10, Rich Felker wrote:
> >>> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
> >>>> Based on previous discussions on enabling PAC and BTI for Aarch64
> >>>> targets, rather than annotating the existing assembler, use inline
> >>>> assembly and mix of C. Now this has the benefits of:
> >>>> 1. Handling PAC, BTI and GCS.
> >>>> a. prologue and eplilog insertion as needed.
> >>>> b. Adding GNU notes as needed.
> >>>> 2. Adding in the CFI statements as needed.
> >>>>
> >>>> I'd love to get feedback, thanks!
> >>>>
> >>>> Bill Roberts (14):
> >>>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
> >>>> aarch64: rewrite fenv routines in C using inline asm
> >>>> aarch64: rewrite vfork routine in C using inline asm
> >>>> aarch64: rewrite clone routine in C using inline asm
> >>>> aarch64: rewrite __syscall_cp_asm in C using inline asm
> >>>> aarch64: rewrite __unmapself in C using inline asm
> >>>> aarch64: rewrite tlsdesc reoutines in C using inline asm
> >>>> aarch64: rewrite __restore_rt routines in C using inline asm
> >>>> aarch64: rewrite longjmp routines in C using inline asm
> >>>> aarch64: rewrite setjmp routines in C using inline asm
> >>>> aarch64: rewrite sigsetjmp routines in C using inline asm
> >>>> aarch64: rewrite dlsym routine in C using inline asm
> >>>> aarch64: rewrite memcpy routine in C using inline asm
> >>>> aarch64: rewrite memset routine in C using inline asm
> >>>
> >>> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
> >>> and dlsym are fundamentally wrong in that they have to be asm entry
> >>> points. Wrapping them in C breaks the state they need to receive.
> >>>
> >>> Some others like __syscall_cp_asm are wrong by virtue of putting
> >>> symbol definitions inside inline asm, which may be emitted a different
> >>> number of times than it appears in the source. The labels in
> >>> __syscall_cp_asm must exist only once, so it really needs to be
> >>> external asm (for a slightly different reason than the entry point
> >>> needing to be asm).
> >>>
> >>> The advice to move to inline asm was to do it where possible, i.e.
> >>> where it's gratuitous that we had an asm source file. But even where
> >>> this can be done, it should be done by actually writing the inline asm
> >>> with proper register constraints, not just copy-pasting the asm into C
> >>> files wrapped in __asm__. Some things, like __clone, even if they
> >>> could be done as C source files with asm, are not valid the way you've
> >>> just wrapped them because you're performing a return from within the
> >>> asm but don't have access to the return address or any way to undo
> >>> potential stack adjustments made in prologue before the __asm__. And
> >>> this would catastrophically break if LTO'd.
> >>>
> >>> memcpy and memset are slated for "removal" at some point, replacing
> >>> the high level flow logic in arch-specific asm with shared high level
> >>> C and arch-provided asm only for the middle-section bulk copy/fill
> >>> operation in aligned and unaligned variants. I'm really not up for
> >>> reviewing and trusting in the correctness of large changes to any of
> >>> the existing arch-specific memcpy/memset asm or adding new ones for
> >>> other archs until then, because it's effort on something that's
> >>> intended to be removed. So these should just be kept as-is for now.
> >>
> >> There is code in the wild that relies on memcpy not actually causing
> >> data races, even though the C standard says otherwise. The problem is
> >> that the standard provides literally no option for accessing memory
> >> that may be concurrently modified by untrusted code, even doing so
> >> in assembly is perfectly okay.
> >>
> >> To avoid data races, this code would need to be rewritten to use
> >> assembly code for various architectures. I doubt this is a feasible
> >> solution.
> >>
> >> The proper fix is for the standard to include bytewise-atomic memcpy.
> >> Until then, people will use hacks like this.
> >> As a quality of implementation matter, I strongly recommend that all
> >> accesses by memcpy() to user buffers happen in assembly code.
> >
> > musl does not go out of its way to facilitate gross UB by
> > applications. If anything (if it's detectable and detected at low
> > cost, or if it's so high-risk that some cost is acceptable) we trap
> > and immediately crash when it's detected. We don't just make it
> > silently "do what the programmer wanted".
> >
> > If a program is operating on memory that may change asynchronously out
> > from under it, it needs to be using the appropriate volatile or atomic
> > qualifications. memcpy very intentionally does not take a volatile
> > void * because it's not valid to pass pointers to such memory to
> > memcpy.
>
> People do this because they have no other decent choices.
>
> Doing anything else in C winds winds up with a 40x slowdown because the
> compiler cannot optimize anything.
Not 40x, maybe 4x at worst. A naive C memcpy is not particularly bad.
One with volatile and byte accesses might be a bit worse, but you
don't do a generic volatile-memcpy with byte accesses for this. If
your data is 32-bit pixel values or something, you write the code to
work with that. And if you want to lightly vectorize that in a fairly
portable way, you can conditionally use __attribute__((__may_alias__))
with a larger type to perform the volatile copy in larger units.
> Using hand-written assembly is a
> maintenance nightmare. volatile is for MMIO, and atomics are for where
> synchronization is needed. Both severely over-constrain the compiler.
volatile can represent any sort of memory where the contents can
change asynchronously out from under you, where the number and order
and size of accesses in the generated machine code needs to match the
number and order and size on the abstract machine.
> Probably the best currently available solution is to come up with an
> assembly code library that does the job. But that's going to be a
This is probably true. And I think some intensive video/graphics
processing code does roughly that already, just for mainstream archs.
> lot of work, and it won't fix the (likely *many*) programs with this
> bug in the wild. Some of which are security-critical.
I'm really skeptical of "security-critical". If there are programs
processing memory that can be modified by something else out from
under them in a security-relevant context(*), those programs should
absolutely not be used. Avoiding formal data races is the least of
their problem. Intentional malicious modification in particular orders
to subvert program logic is a much bigger issue.
(*) as in not just image or sound buffer contents where the worst-case
failure in practice is visual glitches or audio pops
> Is the risk of breaking existing code worth theoretical purity here?
It has always been broken. We can't unilaterally change that. No
existing implementations make any promises of working with volatile
source or destination buffers, and even in musl, most archs use the C
memcpy code. There are only a couple with asm, and the asm is worse
than the naive C in some important cases (like small n). The
motivation for changing it is not "breaking existing code" but fixing
gratuitously bad performance and making it practical to support
optimized bulk copies on all archs. I am not comfortable trusting asm
loop logic for N archs and being responsible for the consequences if
any of them have bugs, nor committing to the review work that would
entail, nor outsourcing it. But small linear per-arch inline asm is
tractable.
Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 20:08 ` Rich Felker
@ 2025-12-14 20:52 ` Demi Marie Obenour
0 siblings, 0 replies; 30+ messages in thread
From: Demi Marie Obenour @ 2025-12-14 20:52 UTC (permalink / raw)
To: Rich Felker; +Cc: musl, Bill Roberts
[-- Attachment #1.1.1: Type: text/plain, Size: 9791 bytes --]
On 12/14/25 15:08, Rich Felker wrote:
> On Sun, Dec 14, 2025 at 02:11:49PM -0500, Demi Marie Obenour wrote:
>> On 12/14/25 10:18, Rich Felker wrote:
>>> On Sat, Dec 13, 2025 at 09:22:43PM -0500, Demi Marie Obenour wrote:
>>>> On 12/8/25 14:10, Rich Felker wrote:
>>>>> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
>>>>>> Based on previous discussions on enabling PAC and BTI for Aarch64
>>>>>> targets, rather than annotating the existing assembler, use inline
>>>>>> assembly and mix of C. Now this has the benefits of:
>>>>>> 1. Handling PAC, BTI and GCS.
>>>>>> a. prologue and eplilog insertion as needed.
>>>>>> b. Adding GNU notes as needed.
>>>>>> 2. Adding in the CFI statements as needed.
>>>>>>
>>>>>> I'd love to get feedback, thanks!
>>>>>>
>>>>>> Bill Roberts (14):
>>>>>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
>>>>>> aarch64: rewrite fenv routines in C using inline asm
>>>>>> aarch64: rewrite vfork routine in C using inline asm
>>>>>> aarch64: rewrite clone routine in C using inline asm
>>>>>> aarch64: rewrite __syscall_cp_asm in C using inline asm
>>>>>> aarch64: rewrite __unmapself in C using inline asm
>>>>>> aarch64: rewrite tlsdesc reoutines in C using inline asm
>>>>>> aarch64: rewrite __restore_rt routines in C using inline asm
>>>>>> aarch64: rewrite longjmp routines in C using inline asm
>>>>>> aarch64: rewrite setjmp routines in C using inline asm
>>>>>> aarch64: rewrite sigsetjmp routines in C using inline asm
>>>>>> aarch64: rewrite dlsym routine in C using inline asm
>>>>>> aarch64: rewrite memcpy routine in C using inline asm
>>>>>> aarch64: rewrite memset routine in C using inline asm
>>>>>
>>>>> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
>>>>> and dlsym are fundamentally wrong in that they have to be asm entry
>>>>> points. Wrapping them in C breaks the state they need to receive.
>>>>>
>>>>> Some others like __syscall_cp_asm are wrong by virtue of putting
>>>>> symbol definitions inside inline asm, which may be emitted a different
>>>>> number of times than it appears in the source. The labels in
>>>>> __syscall_cp_asm must exist only once, so it really needs to be
>>>>> external asm (for a slightly different reason than the entry point
>>>>> needing to be asm).
>>>>>
>>>>> The advice to move to inline asm was to do it where possible, i.e.
>>>>> where it's gratuitous that we had an asm source file. But even where
>>>>> this can be done, it should be done by actually writing the inline asm
>>>>> with proper register constraints, not just copy-pasting the asm into C
>>>>> files wrapped in __asm__. Some things, like __clone, even if they
>>>>> could be done as C source files with asm, are not valid the way you've
>>>>> just wrapped them because you're performing a return from within the
>>>>> asm but don't have access to the return address or any way to undo
>>>>> potential stack adjustments made in prologue before the __asm__. And
>>>>> this would catastrophically break if LTO'd.
>>>>>
>>>>> memcpy and memset are slated for "removal" at some point, replacing
>>>>> the high level flow logic in arch-specific asm with shared high level
>>>>> C and arch-provided asm only for the middle-section bulk copy/fill
>>>>> operation in aligned and unaligned variants. I'm really not up for
>>>>> reviewing and trusting in the correctness of large changes to any of
>>>>> the existing arch-specific memcpy/memset asm or adding new ones for
>>>>> other archs until then, because it's effort on something that's
>>>>> intended to be removed. So these should just be kept as-is for now.
>>>>
>>>> There is code in the wild that relies on memcpy not actually causing
>>>> data races, even though the C standard says otherwise. The problem is
>>>> that the standard provides literally no option for accessing memory
>>>> that may be concurrently modified by untrusted code, even doing so
>>>> in assembly is perfectly okay.
>>>>
>>>> To avoid data races, this code would need to be rewritten to use
>>>> assembly code for various architectures. I doubt this is a feasible
>>>> solution.
>>>>
>>>> The proper fix is for the standard to include bytewise-atomic memcpy.
>>>> Until then, people will use hacks like this.
>>>> As a quality of implementation matter, I strongly recommend that all
>>>> accesses by memcpy() to user buffers happen in assembly code.
>>>
>>> musl does not go out of its way to facilitate gross UB by
>>> applications. If anything (if it's detectable and detected at low
>>> cost, or if it's so high-risk that some cost is acceptable) we trap
>>> and immediately crash when it's detected. We don't just make it
>>> silently "do what the programmer wanted".
>>>
>>> If a program is operating on memory that may change asynchronously out
>>> from under it, it needs to be using the appropriate volatile or atomic
>>> qualifications. memcpy very intentionally does not take a volatile
>>> void * because it's not valid to pass pointers to such memory to
>>> memcpy.
>>
>> People do this because they have no other decent choices.
>>
>> Doing anything else in C winds winds up with a 40x slowdown because the
>> compiler cannot optimize anything.
>
> Not 40x, maybe 4x at worst. A naive C memcpy is not particularly bad.
> One with volatile and byte accesses might be a bit worse, but you
> don't do a generic volatile-memcpy with byte accesses for this. If
> your data is 32-bit pixel values or something, you write the code to
> work with that. And if you want to lightly vectorize that in a fairly
> portable way, you can conditionally use __attribute__((__may_alias__))
> with a larger type to perform the volatile copy in larger units.
Thanks for the suggestion! Presumably a decent compiler should be
able to turn volatile __uint128 accesses into 128-bit SIMD operations
where supported. So this will not be as bad as I thought.
On x86, rep movsb is another option, but it has a horrible corner
case on Zen 3 where it slows to a crawl due to what appears to be a
microcode bug.
Of course, too much complexity can mean I-cache misses and that's
also bad.
>> Using hand-written assembly is a
>> maintenance nightmare. volatile is for MMIO, and atomics are for where
>> synchronization is needed. Both severely over-constrain the compiler.
>
> volatile can represent any sort of memory where the contents can
> change asynchronously out from under you, where the number and order
> and size of accesses in the generated machine code needs to match the
> number and order and size on the abstract machine.
That's what I meant by "over-constrain". In this case, the order
and size *do not* need to match those of the abstract machine.
It's even okay for the source memory to be read more than once,
or for the destination to be written to more than once. All that
matters is that concurrent modification does not cause UB.
In these applications, concurrent modification *always* indicates a
bug or malicious activity, almost always the former. The problem is
that the other side cannot be trusted to not do that.
>> Probably the best currently available solution is to come up with an
>> assembly code library that does the job. But that's going to be a
>
> This is probably true. And I think some intensive video/graphics
> processing code does roughly that already, just for mainstream archs.
>
>> lot of work, and it won't fix the (likely *many*) programs with this
>> bug in the wild. Some of which are security-critical.
>
> I'm really skeptical of "security-critical". If there are programs
> processing memory that can be modified by something else out from
> under them in a security-relevant context(*), those programs should
> absolutely not be used. Avoiding formal data races is the least of
> their problem. Intentional malicious modification in particular orders
> to subvert program logic is a much bigger issue.
>
> (*) as in not just image or sound buffer contents where the worst-case
> failure in practice is visual glitches or audio pops
My motivating example is libxenvchan, which doesn't expose the shared
buffer to applications. All libxenvchan API functions copy the data
from the shared memory to application-provided memory. In fact, the
motivation for libxenvchan is to make time-of-check to time-of-use
vulnerabilities much harder to write. The challenge is to perform
the copy in an efficient way without undefined behavior.
>> Is the risk of breaking existing code worth theoretical purity here?
>
> It has always been broken. We can't unilaterally change that. No
> existing implementations make any promises of working with volatile
> source or destination buffers, and even in musl, most archs use the C
> memcpy code. There are only a couple with asm, and the asm is worse
> than the naive C in some important cases (like small n). The
> motivation for changing it is not "breaking existing code" but fixing
> gratuitously bad performance and making it practical to support
> optimized bulk copies on all archs. I am not comfortable trusting asm
> loop logic for N archs and being responsible for the consequences if
> any of them have bugs, nor committing to the review work that would
> entail, nor outsourcing it. But small linear per-arch inline asm is
> tractable.
>
> Rich
That makes sense. Thanks for the explanation! I sent an email to
xen-devel recommending that they fix the bug. Xen already uses a
bunch of assembly code (it's a hypervisor, after all!), but even the
C suggestion you mentioned above should do the job.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-14 20:06 ` James Y Knight
@ 2025-12-15 22:18 ` Demi Marie Obenour
0 siblings, 0 replies; 30+ messages in thread
From: Demi Marie Obenour @ 2025-12-15 22:18 UTC (permalink / raw)
To: musl, James Y Knight; +Cc: Rich Felker, Bill Roberts
[-- Attachment #1.1.1: Type: text/plain, Size: 1706 bytes --]
On 12/14/25 15:06, James Y Knight wrote:
> On Sat, Dec 13, 2025, 9:23 PM Demi Marie Obenour <demiobenour@gmail.com>
> wrote:
>
>> There is code in the wild that relies on memcpy not actually causing
>> data races, even though the C standard says otherwise. The problem is
>> that the standard provides literally no option for accessing memory
>> that may be concurrently modified by untrusted code, even doing so
>> in assembly is perfectly okay.
>>
>> To avoid data races, this code would need to be rewritten to use
>> assembly code for various architectures. I doubt this is a feasible
>> solution.
>>
>> The proper fix is for the standard to include bytewise-atomic memcpy.
>> Until then, people will use hacks like this.
>> As a quality of implementation matter, I strongly recommend that all
>> accesses by memcpy() to user buffers happen in assembly code.
>>
>
> I don't see how software could be usefully relying on memcpy being
> implemented in asm for correctness/security, when common C compilers
> recognize and optimize calls to memcpy under a data-race-free assumption
> regardless of how the C library implements the out-of-line version of the
> function.
Makes sense. This is a case where the only way to get optimal
performance is to hand-write assembly code. There is no way to
avoid undefined behavior without significantly over-constraining
the compiler.
What I really want is to be able to disable the data-race-free
assumption with a compiler switch. It makes anything dealing with
shared memory a nightmare, and (to the best of my understanding)
the performance gain is limited at best.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-08 19:10 ` [musl] [RFC 00/14] aarch64: Convert to " Rich Felker
2025-12-14 2:22 ` Demi Marie Obenour
@ 2025-12-18 8:34 ` Bill Roberts
2026-01-07 15:06 ` Rich Felker
1 sibling, 1 reply; 30+ messages in thread
From: Bill Roberts @ 2025-12-18 8:34 UTC (permalink / raw)
To: musl
Sorry for the delay folks, holidays and such. Happy 2026!
On 12/8/25 1:10 PM, Rich Felker wrote:
> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
>> Based on previous discussions on enabling PAC and BTI for Aarch64
>> targets, rather than annotating the existing assembler, use inline
>> assembly and mix of C. Now this has the benefits of:
>> 1. Handling PAC, BTI and GCS.
>> a. prologue and eplilog insertion as needed.
>> b. Adding GNU notes as needed.
>> 2. Adding in the CFI statements as needed.
>>
>> I'd love to get feedback, thanks!
>>
>> Bill Roberts (14):
>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
>> aarch64: rewrite fenv routines in C using inline asm
>> aarch64: rewrite vfork routine in C using inline asm
>> aarch64: rewrite clone routine in C using inline asm
>> aarch64: rewrite __syscall_cp_asm in C using inline asm
>> aarch64: rewrite __unmapself in C using inline asm
>> aarch64: rewrite tlsdesc reoutines in C using inline asm
>> aarch64: rewrite __restore_rt routines in C using inline asm
>> aarch64: rewrite longjmp routines in C using inline asm
>> aarch64: rewrite setjmp routines in C using inline asm
>> aarch64: rewrite sigsetjmp routines in C using inline asm
>> aarch64: rewrite dlsym routine in C using inline asm
>> aarch64: rewrite memcpy routine in C using inline asm
>> aarch64: rewrite memset routine in C using inline asm
>
> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
> and dlsym are fundamentally wrong in that they have to be asm entry
> points. Wrapping them in C breaks the state they need to receive.
I went through the generated code and ran tests against all of this and
it didn't break, is there some specific case or some compiler option
where this explodes? What state exactly gets trashed?
>
> Some others like __syscall_cp_asm are wrong by virtue of putting
> symbol definitions inside inline asm, which may be emitted a different
> number of times than it appears in the source. The labels in
> __syscall_cp_asm must exist only once, so it really needs to be
> external asm (for a slightly different reason than the entry point
> needing to be asm).
Ah yes, in-lining could duplicate the label.
>
> The advice to move to inline asm was to do it where possible, i.e.
> where it's gratuitous that we had an asm source file. But even where
> this can be done, it should be done by actually writing the inline asm
> with proper register constraints, not just copy-pasting the asm into C
> files wrapped in __asm__. Some things, like __clone, even if they
> could be done as C source files with asm, are not valid the way you've
> just wrapped them because you're performing a return from within the
> asm but don't have access to the return address or any way to undo
> potential stack adjustments made in prologue before the __asm__. And
> this would catastrophically break if LTO'd.
>
> memcpy and memset are slated for "removal" at some point, replacing
> the high level flow logic in arch-specific asm with shared high level
> C and arch-provided asm only for the middle-section bulk copy/fill
> operation in aligned and unaligned variants. I'm really not up for
> reviewing and trusting in the correctness of large changes to any of
> the existing arch-specific memcpy/memset asm or adding new ones for
> other archs until then, because it's effort on something that's
> intended to be removed. So these should just be kept as-is for now.
Ok, so then where we keep asm the same, you want to just always include
a BTI or PAC instruction as needed?
>
> The approach to the fenv changes looks roughly right. This is also
> something I'd like to do in an arch-generic way at some point, but
> there's no good reason not to do it first on aarch64 like you've
> proposed.
So perhaps I can send this as a separate patch not in an RFC state?
>
> Removing crt[in].s is probably okay as well.
Same for this patch, send as a non-rfc ready to go?
>
> We generally prefer patch series as a single email with multiple MIME
> attachments instead of git send-email threads, if that's easy for you
> to do. It's not a big deal either way but it keeps folks' inbox volume
> down and makes it easier to reply with review of the whole series
> together.
I need to figure that out exactly, my git send-email foo is not strong.
Is this literally just firing up a mail client and sending an email with
attachments or is this using git send-email with -m?
>
> Rich
Thanks Rich, so reading your comments, I think I'll need to re-architect
some of the code base, not a problem. Something similar in concept to
the fenv patch style. I can use plain asm entry points that just tail
call into C and then they would just get marked with a single BTI C
instruction which would just NOP on non-supported platforms.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 03/14] aarch64: rewrite vfork routine in C using inline asm
2025-12-11 12:09 ` Florian Weimer
2025-12-12 2:34 ` Rich Felker
@ 2025-12-18 10:33 ` Bill Roberts
1 sibling, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2025-12-18 10:33 UTC (permalink / raw)
To: musl, Florian Weimer, Bill Roberts
On 12/11/25 6:09 AM, Florian Weimer wrote:
> * Bill Roberts:
>
>> diff --git a/src/process/aarch64/vfork.c b/src/process/aarch64/vfork.c
>> new file mode 100644
>> index 00000000..87ec8ebf
>> --- /dev/null
>> +++ b/src/process/aarch64/vfork.c
>> @@ -0,0 +1,21 @@
>> +#include <sys/types.h>
>> +
>> +#include "syscall.h"
>> +
>> +pid_t vfork(void)
>> +{
>> + /* aarch64 Linux syscall: x8 = nr, x0..x5 = args, ret in x0 */
>> + register long x8 __asm__("x8") = 220; /* SYS_clone */
>> + register long x0 __asm__("x0") = 0x4111; /* SIGCHLD | CLONE_VM | CLONE_VFORK */
>> + register long x1 __asm__("x1") = 0; /* arg2 = 0 */
>> +
>> + __asm__ volatile (
>> + "svc 0\n\t"
>> + ".hidden __syscall_ret\n\t"
>> + "b __syscall_ret\n\t"
>> + : "+r"(x0) /* x0 = in/out */
>> + : "r"(x1), "r"(x8) /* inputs */
>> + : "memory", "cc"
>> + );
>> + __builtin_unreachable();
>> +}
>
> This is incompatible with building with -fstack-protector-all, isn't it?
>
Yeah if the compiler emits the prologue with the canary, the tail call
to __syscall_ret would break that.
> Thanks,
> Florian
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2025-12-18 8:34 ` Bill Roberts
@ 2026-01-07 15:06 ` Rich Felker
2026-01-12 16:44 ` Bill Roberts
0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2026-01-07 15:06 UTC (permalink / raw)
To: Bill Roberts; +Cc: musl
On Thu, Dec 18, 2025 at 02:34:42AM -0600, Bill Roberts wrote:
> Sorry for the delay folks, holidays and such. Happy 2026!
>
> On 12/8/25 1:10 PM, Rich Felker wrote:
> > On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
> > > Based on previous discussions on enabling PAC and BTI for Aarch64
> > > targets, rather than annotating the existing assembler, use inline
> > > assembly and mix of C. Now this has the benefits of:
> > > 1. Handling PAC, BTI and GCS.
> > > a. prologue and eplilog insertion as needed.
> > > b. Adding GNU notes as needed.
> > > 2. Adding in the CFI statements as needed.
> > >
> > > I'd love to get feedback, thanks!
> > >
> > > Bill Roberts (14):
> > > aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
> > > aarch64: rewrite fenv routines in C using inline asm
> > > aarch64: rewrite vfork routine in C using inline asm
> > > aarch64: rewrite clone routine in C using inline asm
> > > aarch64: rewrite __syscall_cp_asm in C using inline asm
> > > aarch64: rewrite __unmapself in C using inline asm
> > > aarch64: rewrite tlsdesc reoutines in C using inline asm
> > > aarch64: rewrite __restore_rt routines in C using inline asm
> > > aarch64: rewrite longjmp routines in C using inline asm
> > > aarch64: rewrite setjmp routines in C using inline asm
> > > aarch64: rewrite sigsetjmp routines in C using inline asm
> > > aarch64: rewrite dlsym routine in C using inline asm
> > > aarch64: rewrite memcpy routine in C using inline asm
> > > aarch64: rewrite memset routine in C using inline asm
> >
> > Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
> > and dlsym are fundamentally wrong in that they have to be asm entry
> > points. Wrapping them in C breaks the state they need to receive.
>
> I went through the generated code and ran tests against all of this and it
> didn't break, is there some specific case or some compiler option
> where this explodes? What state exactly gets trashed?
Generated code? That's not how this works. The code has to be
semantically correct, not happen to produce machine code that's
correct when compiled with the compiler you tested it with.
Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [musl] [RFC 00/14] aarch64: Convert to inline asm
2026-01-07 15:06 ` Rich Felker
@ 2026-01-12 16:44 ` Bill Roberts
0 siblings, 0 replies; 30+ messages in thread
From: Bill Roberts @ 2026-01-12 16:44 UTC (permalink / raw)
To: musl
On 1/7/26 9:06 AM, Rich Felker wrote:
> On Thu, Dec 18, 2025 at 02:34:42AM -0600, Bill Roberts wrote:
>> Sorry for the delay folks, holidays and such. Happy 2026!
>>
>> On 12/8/25 1:10 PM, Rich Felker wrote:
>>> On Mon, Dec 08, 2025 at 11:44:43AM -0600, Bill Roberts wrote:
>>>> Based on previous discussions on enabling PAC and BTI for Aarch64
>>>> targets, rather than annotating the existing assembler, use inline
>>>> assembly and mix of C. Now this has the benefits of:
>>>> 1. Handling PAC, BTI and GCS.
>>>> a. prologue and eplilog insertion as needed.
>>>> b. Adding GNU notes as needed.
>>>> 2. Adding in the CFI statements as needed.
>>>>
>>>> I'd love to get feedback, thanks!
>>>>
>>>> Bill Roberts (14):
>>>> aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
>>>> aarch64: rewrite fenv routines in C using inline asm
>>>> aarch64: rewrite vfork routine in C using inline asm
>>>> aarch64: rewrite clone routine in C using inline asm
>>>> aarch64: rewrite __syscall_cp_asm in C using inline asm
>>>> aarch64: rewrite __unmapself in C using inline asm
>>>> aarch64: rewrite tlsdesc reoutines in C using inline asm
>>>> aarch64: rewrite __restore_rt routines in C using inline asm
>>>> aarch64: rewrite longjmp routines in C using inline asm
>>>> aarch64: rewrite setjmp routines in C using inline asm
>>>> aarch64: rewrite sigsetjmp routines in C using inline asm
>>>> aarch64: rewrite dlsym routine in C using inline asm
>>>> aarch64: rewrite memcpy routine in C using inline asm
>>>> aarch64: rewrite memset routine in C using inline asm
>>>
>>> Of these, at least vfork, tlsdesc, __restore_rt, setjmp, sigsetjmp,
>>> and dlsym are fundamentally wrong in that they have to be asm entry
>>> points. Wrapping them in C breaks the state they need to receive.
>>
>> I went through the generated code and ran tests against all of this and it
>> didn't break, is there some specific case or some compiler option
>> where this explodes? What state exactly gets trashed?
>
> Generated code? That's not how this works. The code has to be
> semantically correct, not happen to produce machine code that's
> correct when compiled with the compiler you tested it with.
Yes, and we *agree* there. I see my fallacy with vfork, in my head I was
thinking fork semantics, which is not the case (duh).
For anyone curious, in vfork, the child shares the address space with
the parent, so it has to be very careful on state modification, which in
C we can't guarantee. For instance, C could spill to the stack.
So with that said, do you want these patches now while I re-spin?
- aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI
- aarch64: rewrite fenv routines in C using inline asm
>
> Rich
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2026-01-12 16:44 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-08 17:44 [musl] [RFC 00/14] aarch64: Convert to inline asm Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 01/14] aarch64: drop crt(i|n).s since NO_LEGACY_INITFINI Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 02/14] aarch64: rewrite fenv routines in C using inline asm Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 03/14] aarch64: rewrite vfork routine " Bill Roberts
2025-12-11 12:09 ` Florian Weimer
2025-12-12 2:34 ` Rich Felker
2025-12-18 10:33 ` Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 04/14] aarch64: rewrite clone " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 05/14] aarch64: rewrite __syscall_cp_asm " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 06/14] aarch64: rewrite __unmapself " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 07/14] aarch64: rewrite tlsdesc reoutines " Bill Roberts
2025-12-11 12:10 ` Florian Weimer
2025-12-08 17:44 ` [musl] [RFC 08/14] aarch64: rewrite __restore_rt routines " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 09/14] aarch64: rewrite longjmp " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 10/14] aarch64: rewrite setjmp " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 11/14] aarch64: rewrite sigsetjmp " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 12/14] aarch64: rewrite dlsym routine " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 13/14] aarch64: rewrite memcpy " Bill Roberts
2025-12-08 17:44 ` [musl] [RFC 14/14] aarch64: rewrite memset " Bill Roberts
2025-12-08 19:10 ` [musl] [RFC 00/14] aarch64: Convert to " Rich Felker
2025-12-14 2:22 ` Demi Marie Obenour
2025-12-14 15:18 ` Rich Felker
2025-12-14 19:11 ` Demi Marie Obenour
2025-12-14 20:08 ` Rich Felker
2025-12-14 20:52 ` Demi Marie Obenour
2025-12-14 20:06 ` James Y Knight
2025-12-15 22:18 ` Demi Marie Obenour
2025-12-18 8:34 ` Bill Roberts
2026-01-07 15:06 ` Rich Felker
2026-01-12 16:44 ` Bill Roberts
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).