* [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-10-23 15:53 Pincheng Wang
2025-10-23 15:53 ` [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset Pincheng Wang
0 siblings, 1 reply; 3+ messages in thread
From: Pincheng Wang @ 2025-10-23 15:53 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
Hi all,
This is v2 of the RISC-V Vector (RVV) optimized memset patch. I'm
resending it because I forgot to commit the latest changes to Git. Sorry
for the inconvenience!
Changes from v1:
- Replaced compile-time detetion (__riscv_vector macro) with runtime
detection using AT_HWCAP, addressing the main concern from v1
feedback.
- Introduced a dispatch mechanism (memset_dispatch.c) that selects
appropriate implementation at process startup.
- Added arch.mak configuration to prevent GCC auto-vectorization on
other string functions, ensuring only our runtime-detected code uses
vector insturctions.
- Single binary now works correctly on both RVV and non-RVV hardware
when built with CFLAGS+="-march=rv64gcv".
Implementation details:
- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
- memset_dispatch.c exports memset() which dispatches via function
pointer.
- __init_riscv_string_optimizations() is called in __libc_start_main to
initialize the function pointer based on AT_HWCAP.
- The vector implementation uses vsetvli for bulk fills and a head-tail
strategy for small sizes.
Performance (unchanged from v1):
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
larger sizes.
For very small sizes (<8 bytes), there can be minor regressions compared
to the generic C version. This is a trade off for the significant gains
on larger sizes.
Additional Notes:
It was mentioned during internal discussions that, according to the XuanTie C908 user
manuals [1], vector memory operations must not target addresses with
Strong Order (SO) attributes, otherwise the system may crash. In the
current implementation, this scenario is handled by OpenSBI fallback
mechanisms, but this leads to degraded performance compared to scalar
implementations.
I reviewed other existing vectorized mem* patches and did not find
explicit handling for this case. Introducing explicit attribute checks
would likely add extra overhead. Therefore, I am currently uncertain
whether special handling for XuanTie CPUs should be included in this
patch or addressed separately.
Testing:
- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
- Spacemit X60 with V extension support.
- XuanTie C908 with V extension support.
- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
Functional behavior matches generic memset.
Thanks,
Pincheng Wang
Pincheng Wang (1):
riscv64: add runtime-detected vector optimized memset
0000-cover-letter.patch | 79 ++++++
0000-cover-letter.patch.bak | 79 ++++++
...ime-detected-vector-optimized-memset.patch | 259 ++++++++++++++++++
arch/riscv64/arch.mak | 12 +
src/env/__libc_start_main.c | 3 +
src/internal/libc.h | 1 +
src/string/riscv64/memset.S | 134 +++++++++
src/string/riscv64/memset_dispatch.c | 38 +++
test_memset | Bin 0 -> 8824 bytes
9 files changed, 605 insertions(+)
create mode 100644 0000-cover-letter.patch
create mode 100644 0000-cover-letter.patch.bak
create mode 100644 0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
create mode 100644 arch/riscv64/arch.mak
create mode 100644 src/string/riscv64/memset.S
create mode 100644 src/string/riscv64/memset_dispatch.c
create mode 100755 test_memset
--
2.39.5
^ permalink raw reply [flat|nested] 3+ messages in thread
* [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset
2025-10-23 15:53 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
@ 2025-10-23 15:53 ` Pincheng Wang
0 siblings, 0 replies; 3+ messages in thread
From: Pincheng Wang @ 2025-10-23 15:53 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
Add a RISC-V vector extension optimized memset implementation with
runtime CPU capability detection via HW_CAP.
The implementation provides both vector and scalar variants in a single
binary. At process startup, __init_riscv_string_optimizations() queries
AT_HWCAP to detect RVV support and selects the appropriate
implementation via function pointer dispatch. This allows the same libc
to run correctly on both vector-capable and non-vector RISC-V CPUs.
The vector implementation uses vsetvli for dynamic vector length and
employs a head-tail filling strategy for small sizes to minimize
overhead.
To prevent illegal instruction error, arch.mak disables compiler
auto-vectorization globally except for memset.S, ensuring only the
runtime-detected code uses vector instructions.
Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
---
0000-cover-letter.patch | 79 ++++++
0000-cover-letter.patch.bak | 79 ++++++
...ime-detected-vector-optimized-memset.patch | 259 ++++++++++++++++++
arch/riscv64/arch.mak | 12 +
src/env/__libc_start_main.c | 3 +
src/internal/libc.h | 1 +
src/string/riscv64/memset.S | 134 +++++++++
src/string/riscv64/memset_dispatch.c | 38 +++
test_memset | Bin 0 -> 8824 bytes
9 files changed, 605 insertions(+)
create mode 100644 0000-cover-letter.patch
create mode 100644 0000-cover-letter.patch.bak
create mode 100644 0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
create mode 100644 arch/riscv64/arch.mak
create mode 100644 src/string/riscv64/memset.S
create mode 100644 src/string/riscv64/memset_dispatch.c
create mode 100755 test_memset
diff --git a/0000-cover-letter.patch b/0000-cover-letter.patch
new file mode 100644
index 00000000..83ebe09e
--- /dev/null
+++ b/0000-cover-letter.patch
@@ -0,0 +1,79 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:28 +0800
+Subject: [PATCH v2 0/1] riscv64: Add RVV optimized memset implementation
+
+Hi all,
+
+This is v2 of the RISC-V Vector (RVV) optimized memset patch.
+
+Changes from v1:
+- Replaced compile-time detetion (__riscv_vector macro) with runtime
+ detection using AT_HWCAP, addressing the main concern from v1
+ feedback.
+- Introduced a dispatch mechanism (memset_dispatch.c) that selects
+ appropriate implementation at process startup.
+- Added arch.mak configuration to prevent GCC auto-vectorization on
+ other string functions, ensuring only our runtime-detected code uses
+ vector insturctions.
+- Single binary now works correctly on both RVV and non-RVV hardware
+ when built with CFLAGS+="-march=rv64gcv".
+
+Implementation details:
+- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
+- memset_dispatch.c exports memset() which dispatches via function
+ pointer.
+- __init_riscv_string_optimizations() is called in __libc_start_main to
+ initialize the function pointer based on AT_HWCAP.
+- The vector implementation uses vsetvli for bulk fills and a head-tail
+ strategy for small sizes.
+
+Performance (unchanged from v1):
+- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
+ across medium and large sizes.
+- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
+ larger sizes.
+
+For very small sizes (<8 bytes), there can be minor regressions compared
+to the generic C version. This is a trade off for the significant gains
+on larger sizes.
+
+Additional Notes:
+It was mentioned during internal discussions that, according to the XuanTie C908 user
+manuals [1], vector memory operations must not target addresses with
+Strong Order (SO) attributes, otherwise the system may crash. In the
+current implementation, this scenario is handled by OpenSBI fallback
+mechanisms, but this leads to degraded performance compared to scalar
+implementations.
+I reviewed other existing vectorized mem* patches and did not find
+explicit handling for this case. Introducing explicit attribute checks
+would likely add extra overhead. Therefore, I am currently uncertain
+whether special handling for XuanTie CPUs should be included in this
+patch or addressed separately.
+
+Testing:
+- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
+- Spacemit X60 with V extension support.
+- XuanTie C908 with V extension support.
+- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
+Functional behavior matches generic memset.
+
+Thanks,
+Pincheng Wang
+
+Pincheng Wang (1):
+ riscv64: add runtime-detected vector optimized memset
+
+ arch/riscv64/arch.mak | 12 +++
+ src/env/__libc_start_main.c | 3 +
+ src/internal/libc.h | 1 +
+ src/string/riscv64/memset.S | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c | 32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+--
+2.39.5
+
diff --git a/0000-cover-letter.patch.bak b/0000-cover-letter.patch.bak
new file mode 100644
index 00000000..83ebe09e
--- /dev/null
+++ b/0000-cover-letter.patch.bak
@@ -0,0 +1,79 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:28 +0800
+Subject: [PATCH v2 0/1] riscv64: Add RVV optimized memset implementation
+
+Hi all,
+
+This is v2 of the RISC-V Vector (RVV) optimized memset patch.
+
+Changes from v1:
+- Replaced compile-time detetion (__riscv_vector macro) with runtime
+ detection using AT_HWCAP, addressing the main concern from v1
+ feedback.
+- Introduced a dispatch mechanism (memset_dispatch.c) that selects
+ appropriate implementation at process startup.
+- Added arch.mak configuration to prevent GCC auto-vectorization on
+ other string functions, ensuring only our runtime-detected code uses
+ vector insturctions.
+- Single binary now works correctly on both RVV and non-RVV hardware
+ when built with CFLAGS+="-march=rv64gcv".
+
+Implementation details:
+- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
+- memset_dispatch.c exports memset() which dispatches via function
+ pointer.
+- __init_riscv_string_optimizations() is called in __libc_start_main to
+ initialize the function pointer based on AT_HWCAP.
+- The vector implementation uses vsetvli for bulk fills and a head-tail
+ strategy for small sizes.
+
+Performance (unchanged from v1):
+- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
+ across medium and large sizes.
+- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
+ larger sizes.
+
+For very small sizes (<8 bytes), there can be minor regressions compared
+to the generic C version. This is a trade off for the significant gains
+on larger sizes.
+
+Additional Notes:
+It was mentioned during internal discussions that, according to the XuanTie C908 user
+manuals [1], vector memory operations must not target addresses with
+Strong Order (SO) attributes, otherwise the system may crash. In the
+current implementation, this scenario is handled by OpenSBI fallback
+mechanisms, but this leads to degraded performance compared to scalar
+implementations.
+I reviewed other existing vectorized mem* patches and did not find
+explicit handling for this case. Introducing explicit attribute checks
+would likely add extra overhead. Therefore, I am currently uncertain
+whether special handling for XuanTie CPUs should be included in this
+patch or addressed separately.
+
+Testing:
+- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
+- Spacemit X60 with V extension support.
+- XuanTie C908 with V extension support.
+- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
+Functional behavior matches generic memset.
+
+Thanks,
+Pincheng Wang
+
+Pincheng Wang (1):
+ riscv64: add runtime-detected vector optimized memset
+
+ arch/riscv64/arch.mak | 12 +++
+ src/env/__libc_start_main.c | 3 +
+ src/internal/libc.h | 1 +
+ src/string/riscv64/memset.S | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c | 32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+--
+2.39.5
+
diff --git a/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch b/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
new file mode 100644
index 00000000..c0596073
--- /dev/null
+++ b/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
@@ -0,0 +1,259 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:00 +0800
+Subject: [PATCH v2 1/1] riscv64: add runtime-detected vector optimized memset
+
+Add a RISC-V vector extension optimized memset implementation with
+runtime CPU capability detection via HW_CAP.
+
+The implementation provides both vector and scalar variants in a single
+binary. At process startup, __init_riscv_string_optimizations() queries
+AT_HWCAP to detect RVV support and selects the appropriate
+implementation via function pointer dispatch. This allows the same libc
+to run correctly on both vector-capable and non-vector RISC-V CPUs.
+
+The vector implementation uses vsetvli for dynamic vector length and
+employs a head-tail filling strategy for small sizes to minimize
+overhead.
+
+To prevent illegal instruction error, arch.mak disables compiler
+auto-vectorization globally except for memset.S, ensuring only the
+runtime-detected code uses vector instructions.
+
+Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
+---
+ arch/riscv64/arch.mak | 12 +++
+ src/env/__libc_start_main.c | 3 +
+ src/internal/libc.h | 1 +
+ src/string/riscv64/memset.S | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c | 32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
+new file mode 100644
+index 00000000..5978eb0a
+--- /dev/null
++++ b/arch/riscv64/arch.mak
+@@ -0,0 +1,12 @@
++# Disable tree vectorization for all files except memset.S
++
++# Reason: We have hand-optimized vector memset.S that uses runtime detection
++# to switch between scalar and vector implementations based on CPU capability.
++# However, GCC may auto-vectorize other functions (like memcpy, strcpy, etc.)
++# which would cause illegal instruction errors on CPUs without vector extensions.
++
++# Therefore, we disable auto-vectorization for all files except memset.S,
++# ensuring only our runtime-detected vector code uses vector instructions.
++
++# Add -fno-tree-vectorize to all object files except memset.S
++$(filter-out obj/src/string/riscv64/memset.o obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += -fno-tree-vectorize
+diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
+index c5b277bd..c23e63f4 100644
+--- a/src/env/__libc_start_main.c
++++ b/src/env/__libc_start_main.c
+@@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
+
+ __init_tls(aux);
+ __init_ssp((void *)aux[AT_RANDOM]);
++#ifdef __riscv
++ __init_riscv_string_optimizations();
++#endif
+
+ if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
+ && !aux[AT_SECURE]) return;
+diff --git a/src/internal/libc.h b/src/internal/libc.h
+index 619bba86..28e893a1 100644
+--- a/src/internal/libc.h
++++ b/src/internal/libc.h
+@@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
+ hidden void __init_libc(char **, char *);
+ hidden void __init_tls(size_t *);
+ hidden void __init_ssp(void *);
++hidden void __init_riscv_string_optimizations(void);
+ hidden void __libc_start_init(void);
+ hidden void __funcs_on_exit(void);
+ hidden void __funcs_on_quick_exit(void);
+diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
+new file mode 100644
+index 00000000..8568f32b
+--- /dev/null
++++ b/src/string/riscv64/memset.S
+@@ -0,0 +1,134 @@
++#ifdef __riscv_vector
++
++ .text
++ .global memset_vect
++/* void *memset_vect(void *s, int c, size_t n)
++ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
++ * Returns a0.
++ */
++memset_vect:
++ mv t0, a0 /* running dst; keep a0 as return */
++ beqz a2, .Ldone_vect /* n == 0 then return */
++
++ li t3, 8
++ bltu a2, t3, .Lsmall_vect /* small-size fast path */
++
++ /* Broadcast fill byte once. */
++ vsetvli t1, zero, e8, m8, ta, ma
++ vmv.v.x v0, a1
++
++.Lbulk_vect:
++ vsetvli t1, a2, e8, m8, ta, ma /* t1 = vl (bytes) */
++ vse8.v v0, (t0)
++ add t0, t0, t1
++ sub a2, a2, t1
++ bnez a2, .Lbulk_vect
++ j .Ldone_vect
++
++/* Small-size fast path (< 8).
++ * Head-tail fills to minimize branches and avoid vsetvli overhead.
++ */
++.Lsmall_vect:
++ /* Fill s[0], s[n-1] */
++ sb a1, 0(t0)
++ add t2, t0, a2
++ sb a1, -1(t2)
++ li t3, 2
++ bleu a2, t3, .Ldone_vect
++
++ /* Fill s[1], s[2], s[n-2], s[n-3] */
++ sb a1, 1(t0)
++ sb a1, 2(t0)
++ sb a1, -2(t2)
++ sb a1, -3(t2)
++ li t3, 6
++ bleu a2, t3, .Ldone_vect
++
++ /* Fill s[3], s[n-4] */
++ sb a1, 3(t0)
++ sb a1, -4(t2)
++ /* fallthrough for n <= 8 */
++
++.Ldone_vect:
++ ret
++.size memset_vect, .-memset_vect
++
++ .text
++ .global memset_scalar
++memset_scalar:
++ mv t0, a0
++ beqz a2, .Ldone_scalar
++
++ andi a1, a1, 0xff
++
++ sb a1, 0(t0)
++ add t2, t0, a2
++ sb a1, -1(t2)
++ li t3, 2
++ bleu a2, t3, .Ldone_scalar
++
++ sb a1, 1(t0)
++ sb a1, 2(t0)
++ sb a1, -2(t2)
++ sb a1, -3(t2)
++ li t3, 6
++ bleu a2, t3, .Ldone_scalar
++
++ sb a1, 3(t0)
++ sb a1, -4(t2)
++ li t3, 8
++ bleu a2, t3, .Ldone_scalar
++
++ addi t4, t0, 4
++ addi t5, t2, -4
++.Lloop_scalar:
++ bgeu t4, t5, .Ldone_scalar
++ sb a1, 0(t4)
++ addi t4, t4, 1
++ j .Lloop_scalar
++
++.Ldone_scalar:
++ ret
++.size memset_scalar, .-memset_scalar
++
++#else
++
++ .text
++ .global memset_scalar
++memset_scalar:
++ mv t0, a0
++ beqz a2, .Ldone_scalar
++
++ andi a1, a1, 0xff
++
++ sb a1, 0(t0)
++ add t2, t0, a2
++ sb a1, -1(t2)
++ li t3, 2
++ bleu a2, t3, .Ldone_scalar
++
++ sb a1, 1(t0)
++ sb a1, 2(t0)
++ sb a1, -2(t2)
++ sb a1, -3(t2)
++ li t3, 6
++ bleu a2, t3, .Ldone_scalar
++
++ sb a1, 3(t0)
++ sb a1, -4(t2)
++ li t3, 8
++ bleu a2, t3, .Ldone_scalar
++
++ addi t4, t0, 4
++ addi t5, t2, -4
++.Lloop_scalar:
++ bgeu t4, t5, .Ldone_scalar
++ sb a1, 0(t4)
++ addi t4, t4, 1
++ j .Lloop_scalar
++
++.Ldone_scalar:
++ ret
++.size memset_scalar, .-memset_scalar
++
++#endif
+diff --git a/src/string/riscv64/memset_dispatch.c b/src/string/riscv64/memset_dispatch.c
+new file mode 100644
+index 00000000..c8648177
+--- /dev/null
++++ b/src/string/riscv64/memset_dispatch.c
+@@ -0,0 +1,32 @@
++#include "libc.h"
++#include <stddef.h>
++#include <stdint.h>
++#include <sys/auxv.h>
++
++void *memset_scalar(void *s, int c, size_t n);
++void *memset_vect(void *s, int c, size_t n);
++
++/* Use scalar implementation by default */
++__attribute__((visibility("hidden")))
++void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
++
++void *memset(void *s, int c, size_t n)
++{
++ return __memset_ptr(s, c, n);
++}
++
++static int __has_rvv_via_hwcap(void)
++{
++ const unsigned long V_bit = (1ul << ('V' - 'A'));
++ unsigned long hwcap = getauxval(AT_HWCAP);
++ return (hwcap & V_bit) != 0;
++}
++
++__attribute__((visibility("hidden")))
++void __init_riscv_string_optimizations(void)
++{
++ if (__has_rvv_via_hwcap())
++ __memset_ptr = memset_vect;
++ else
++ __memset_ptr = memset_scalar;
++}
+--
+2.39.5
+
diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
new file mode 100644
index 00000000..5978eb0a
--- /dev/null
+++ b/arch/riscv64/arch.mak
@@ -0,0 +1,12 @@
+# Disable tree vectorization for all files except memset.S
+
+# Reason: We have hand-optimized vector memset.S that uses runtime detection
+# to switch between scalar and vector implementations based on CPU capability.
+# However, GCC may auto-vectorize other functions (like memcpy, strcpy, etc.)
+# which would cause illegal instruction errors on CPUs without vector extensions.
+
+# Therefore, we disable auto-vectorization for all files except memset.S,
+# ensuring only our runtime-detected vector code uses vector instructions.
+
+# Add -fno-tree-vectorize to all object files except memset.S
+$(filter-out obj/src/string/riscv64/memset.o obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += -fno-tree-vectorize
diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
index c5b277bd..c23e63f4 100644
--- a/src/env/__libc_start_main.c
+++ b/src/env/__libc_start_main.c
@@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
__init_tls(aux);
__init_ssp((void *)aux[AT_RANDOM]);
+#ifdef __riscv
+ __init_riscv_string_optimizations();
+#endif
if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
&& !aux[AT_SECURE]) return;
diff --git a/src/internal/libc.h b/src/internal/libc.h
index 619bba86..28e893a1 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
hidden void __init_libc(char **, char *);
hidden void __init_tls(size_t *);
hidden void __init_ssp(void *);
+hidden void __init_riscv_string_optimizations(void);
hidden void __libc_start_init(void);
hidden void __funcs_on_exit(void);
hidden void __funcs_on_quick_exit(void);
diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 00000000..8568f32b
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,134 @@
+#ifdef __riscv_vector
+
+ .text
+ .global memset_vect
+/* void *memset_vect(void *s, int c, size_t n)
+ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
+ * Returns a0.
+ */
+memset_vect:
+ mv t0, a0 /* running dst; keep a0 as return */
+ beqz a2, .Ldone_vect /* n == 0 then return */
+
+ li t3, 8
+ bltu a2, t3, .Lsmall_vect /* small-size fast path */
+
+ /* Broadcast fill byte once. */
+ vsetvli t1, zero, e8, m8, ta, ma
+ vmv.v.x v0, a1
+
+.Lbulk_vect:
+ vsetvli t1, a2, e8, m8, ta, ma /* t1 = vl (bytes) */
+ vse8.v v0, (t0)
+ add t0, t0, t1
+ sub a2, a2, t1
+ bnez a2, .Lbulk_vect
+ j .Ldone_vect
+
+/* Small-size fast path (< 8).
+ * Head-tail fills to minimize branches and avoid vsetvli overhead.
+ */
+.Lsmall_vect:
+ /* Fill s[0], s[n-1] */
+ sb a1, 0(t0)
+ add t2, t0, a2
+ sb a1, -1(t2)
+ li t3, 2
+ bleu a2, t3, .Ldone_vect
+
+ /* Fill s[1], s[2], s[n-2], s[n-3] */
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(t2)
+ sb a1, -3(t2)
+ li t3, 6
+ bleu a2, t3, .Ldone_vect
+
+ /* Fill s[3], s[n-4] */
+ sb a1, 3(t0)
+ sb a1, -4(t2)
+ /* fallthrough for n <= 8 */
+
+.Ldone_vect:
+ ret
+.size memset_vect, .-memset_vect
+
+ .text
+ .global memset_scalar
+memset_scalar:
+ mv t0, a0
+ beqz a2, .Ldone_scalar
+
+ andi a1, a1, 0xff
+
+ sb a1, 0(t0)
+ add t2, t0, a2
+ sb a1, -1(t2)
+ li t3, 2
+ bleu a2, t3, .Ldone_scalar
+
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(t2)
+ sb a1, -3(t2)
+ li t3, 6
+ bleu a2, t3, .Ldone_scalar
+
+ sb a1, 3(t0)
+ sb a1, -4(t2)
+ li t3, 8
+ bleu a2, t3, .Ldone_scalar
+
+ addi t4, t0, 4
+ addi t5, t2, -4
+.Lloop_scalar:
+ bgeu t4, t5, .Ldone_scalar
+ sb a1, 0(t4)
+ addi t4, t4, 1
+ j .Lloop_scalar
+
+.Ldone_scalar:
+ ret
+.size memset_scalar, .-memset_scalar
+
+#else
+
+ .text
+ .global memset_scalar
+memset_scalar:
+ mv t0, a0
+ beqz a2, .Ldone_scalar
+
+ andi a1, a1, 0xff
+
+ sb a1, 0(t0)
+ add t2, t0, a2
+ sb a1, -1(t2)
+ li t3, 2
+ bleu a2, t3, .Ldone_scalar
+
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(t2)
+ sb a1, -3(t2)
+ li t3, 6
+ bleu a2, t3, .Ldone_scalar
+
+ sb a1, 3(t0)
+ sb a1, -4(t2)
+ li t3, 8
+ bleu a2, t3, .Ldone_scalar
+
+ addi t4, t0, 4
+ addi t5, t2, -4
+.Lloop_scalar:
+ bgeu t4, t5, .Ldone_scalar
+ sb a1, 0(t4)
+ addi t4, t4, 1
+ j .Lloop_scalar
+
+.Ldone_scalar:
+ ret
+.size memset_scalar, .-memset_scalar
+
+#endif
diff --git a/src/string/riscv64/memset_dispatch.c b/src/string/riscv64/memset_dispatch.c
new file mode 100644
index 00000000..aadf19fb
--- /dev/null
+++ b/src/string/riscv64/memset_dispatch.c
@@ -0,0 +1,38 @@
+#include "libc.h"
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/auxv.h>
+
+void *memset_scalar(void *s, int c, size_t n);
+#ifdef __riscv_vector
+void *memset_vect(void *s, int c, size_t n);
+#endif
+
+/* Use scalar implementation by default */
+__attribute__((visibility("hidden")))
+void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
+
+void *memset(void *s, int c, size_t n)
+{
+ return __memset_ptr(s, c, n);
+}
+
+static int __has_rvv_via_hwcap(void)
+{
+ const unsigned long V_bit = (1ul << ('V' - 'A'));
+ unsigned long hwcap = getauxval(AT_HWCAP);
+ return (hwcap & V_bit) != 0;
+}
+
+__attribute__((visibility("hidden")))
+void __init_riscv_string_optimizations(void)
+{
+#ifdef __riscv_vector
+ if (__has_rvv_via_hwcap())
+ __memset_ptr = memset_vect;
+ else
+ __memset_ptr = memset_scalar;
+#else
+ __memset_ptr = memset_scalar;
+#endif
+}
diff --git a/test_memset b/test_memset
new file mode 100755
index 0000000000000000000000000000000000000000..5fe4bef7c7b1ebaf407fbc502df4f3ea312b0150
GIT binary patch
literal 8824
zcmeHNZ)_Y_5ud%YV>eA*J8qJuLG%&_FalrBj$_Q9ras$e$ExE5C;5N?J+JT9=Tq;`
zc5j_JY1P`77`I47t}IA|_>ksHBluEPm1qGaO{=I05-5BD3FSi&MXf}^E`fr8*v!10
zIp6KsB2+>G$xrgmo8Qd5e>?Bpn^`{_8|(`h21p6OZ6I#0%EP)?@V6aar5Z^Wbizh_
zJ_Org1M*cG^W<IHnshw-NWjB|)_N=ik&>Oway#JZ3*E_jLOm&volSPf>RG`@e(K}J
zjwp?NrIbo=zBCamLV?>SEfR(5NQ(61o}JKI9)CSku_Ko3I@zw1?S#I}-3yg<p}FaG
z;qllTJVl`)lv4@mr>9HR!@qMAWOtJ%d9&KH6S|xAg_16nFQXlI<pKZZDa$_0{jIew
zw|IU+nT0*+)aagcq9vWmPE58qseJr+M|(@!?r2Yh^SN-VbWi?4<Dzw>DG@J4#!kGn
zOLD5ypEn=$>cfWiY?4tM^-q28>DJc@_r3O+$G2>MX5;Lhp+`PMc-68K(arcb!uvp!
z8X32J_(O~z_Tgc~>*0gtP_vBGYl%2&f6Av1O5qB{Ro*onS7ppp7BW^QZ@FMQsjNE&
z_Jo@UQo-dy<zzZHYNnN)L!ss1+YNdB_}8qq1TP&)eF-iuOe1`-1ZS5b7|u2U><zvZ
z4PFZVq%qiYXTE^+NF$tjr2t8&gKaqB;QqQFCxdlBYvu80epJra`S>%hi+lmP0^gtA
zhFvyRUo{F=1R9?(mZ$rS{*BKqEUG(!xtWV+cLK~$U+E}J8BmxSu&yT?tE&U^vo(SE
zH(rAI8h{txD$JHhW@-Y@e{23%`?~JDaOvAKGtXaL(Yc-szC9Z-7NQLs-dR|D_8%8U
zs{5}en@+zQh#G~d!qmj&M#H$&^reQbrq1cE#ww$GzUK6$#tqOlUtM3de`aCv;oD8^
zRZCkgp8DMBr7PPmoIUmS?B)Hl^ELHV@q7OYr}iP5Y@Dh2Dz^T|nf@*3NpGg6sr}DO
zL9ET!EI#_%+wcC9bY{P{u=wa7ru+BLoQh0GgY8BX`|d7G(YhOYCji78C+CDrH)QAX
zDL0kNhCXp}Vk5AVgj_3+A~T=25<9`z5&U8ZZOQ*%U0Po5MEVZWL8N`l%gYI*vY&$|
zhrxK-Fm`MTXzW3hxv0AfDbyJ70Fuw*-yCA4e$tFsFaG@+@d8rltvg?PU^?(zRqv)+
z2#jCpy|yWEzP6W-L4CGSpX7u6L&wNMWB2RY`eS_s)>mMC1=d$!eFfH6;D2ESUgi5~
zh~o)hDtE|F@*qP(InKc=nl2L8Y-hfi<s43e$m#w}CBpKLd2CTGcm8u6C;nw=IY%nj
zn3p^1T+kbt7XzRI^LRzlrHkpkOzECOWs66OPr5TvS$x=QPwqS;8U=kHqAFfCb=k)B
z2KOhp@G{R4*1LzP^v~>njY!;y*FIgOqX@s>wW#Rz_w;-r)QmT_PlsCfg<HcB@Q`Hd
z=-x<2WbfzAczaP4y^dx&wgclbR;%i(9SrkQZFZ~5wAm45Qf;iwZc_<*#0fjzijR}2
zc-|rIWNOUH#w|3#dL}cG)>6~V<h)AS$;WjqZh9o4$#H9M+gPzS$u$fvi#3sUw6iqb
z+SWPhiPL-Aq}I_+HJSbDG>ErY$ybzW1o5seep-$Pi~d@Uqj$64l;hPP{#=f40P(AG
zyawbwmgBX>d`>xj4_uUeT_Ol#c1Q9CB{0SjxNA5cQI2l{$xoEy_Z9Oc<#-*)IZ%$@
zU)=xYcs+>!mg5cZ9`Da`d~<Q%mg9{O3dzZ}A_?MdZeA7lp9-E$5Vo!ubB$W9!Z_bA
z@b-D-`oF0aYeCLevGej*weX5BFYwDX+W!sW2CS@qh#UI-8v1`)1DA2W;p>+YD67qD
zlU~MJc+;mJM?8S1wjZ|`zv<JTWL$n%c=&Jzaf&+Sy&$X0cQsz=Cs(+g;_npe|4i#w
z`rG@6uQsp0BMvzJUA}#>);fG0e;5qh7wJ{@#cJ(8uJtSZ+0}UE{yl>@tg;_p)cTeF
z|3kznujJ1|T|#`dJk$-1SMG~jh&Kc(@%OoXe_rf^HT0KAKTr$eSIFw}B>uo|#&w=!
zf7r$NR>bS^y+?feaTnv_53CMh-3}#%^@|f2et|V=U*2PZC3(~jVP?vI4(!wISK_^l
z`}=)N<CWt)jCcdAoU3wgQHZaW?|XvV_m2m30x+C|-IM>PKb`5EnNUd^bCgdyrtPIN
z&Cyg#Hs@O5IKs#%*Gxhxn{pAr3A?Fyi({p8ab)9;n|CM1#;|%fOR#crxZBGDBrL~D
zrt+9JRc=Pb)48mb$6+OMl2lE&IVZ2oiAjj(GIrW>tpwhCR@G2rI16Puj(I{^S=TuM
zV~&}zRAM5NIe{i6oTB-bsZ!ETd0F27c(xzqnuO!&8;%}~sn}4jQlNUjG88>H&;#o5
z@S&bq@3G;S8XOpUEZQGaN2A??F;r1_^HZ?{Y<Hk{7*zk@q3-CQI@H&9Bz9C3IZ0ia
zmsaXXZ}ezXb;tS#hNx3@^k5GUK&P~ES@StHZe|l{3sfSlv?G|VMX7Lf3}vSfSKUXB
z$gtIDKJVqpB}J}Oc^_87bWOk)k^nVu2z3*wteU{wC^qEJvSa_K%<Vxqe<I_WqexxH
zqvKN3iFO;p$?QaU+{}+dIB_D2>K=6+2;<1ju$^`x>{^rfbaHweVQZZIU_60Z!a-Q?
zh-mjf3+;xm?N}=w0^!p7ij(Qi%uFhd<H%tb`0#qndtKquKp4F?gU*NppwBE}({-KH
z=mdIO@s#}N6^h^J#Fg`0?!odsZ>O*6^3xgMQz)mHP3+~qe2xLiHB*uEQ{pG#n^5-Q
z@^Vh>B@VpC0Oh)=h`!i+C_oEVC<YY!;{6-XV=8kV(PN~si@n5a`w$~lv6p-L9NTxH
zthD&&{|E|cY~8FR_xT%aKf*Za(C>!7f3l-CCC-(5e}pGIg0j*gd48e&C@i&?-{Cp7
zf510jhKo%eWP6#v#1R)c5N|JPdKDSJz?Au?F2r8q4f%e#TNIi4`;YtVi*X4ri0!lY
z_n$#I#pu$%#4k46hkW+_{+~iQttG#%Vw@x2L;d{;FZ7#eOW#DqUgDu<{!k)%sG^JT
zLcfD3eKQez$(uy((*6S4(p1G>;w$-HE#IXjuOj`5|NH=L>HUel#9{C8haAyI6<vfE
z`ZA)W_L65=Wcvu)2@yM?ulwxf{uJT?ik{zI_&2aZIx>E_-!-pc@7oS=6>SOgk5laF
zd?_trFZAaq#J=rP9jeR!HF$C0id5_bAzpk}FruLbkDFvart_)||71Pn{GxbI=I&W|
e&vTr}dd4U8_fOsURCEJdIk9)kYp(e8+y4s;WmPW#
literal 0
HcmV?d00001
--
2.39.5
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-10-23 16:06 Pincheng Wang
0 siblings, 0 replies; 3+ messages in thread
From: Pincheng Wang @ 2025-10-23 16:06 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
Hi all,
This is v2 of the RISC-V Vector (RVV) optimized memset patch. Resending
because forgot to commit the latest changes to Git. Sincerely sorry for the
inconvenience of messing up the mailing list and please take this
version as the correct one.
Changes from v1:
- Replaced compile-time detetion (__riscv_vector macro) with runtime
detection using AT_HWCAP, addressing the main concern from v1
feedback.
- Introduced a dispatch mechanism (memset_dispatch.c) that selects
appropriate implementation at process startup.
- Added arch.mak configuration to prevent GCC auto-vectorization on
other string functions, ensuring only our runtime-detected code uses
vector insturctions.
- Single binary now works correctly on both RVV and non-RVV hardware
when built with CFLAGS+="-march=rv64gcv".
Implementation details:
- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
- memset_dispatch.c exports memset() which dispatches via function
pointer.
- __init_riscv_string_optimizations() is called in __libc_start_main to
initialize the function pointer based on AT_HWCAP.
- The vector implementation uses vsetvli for bulk fills and a head-tail
strategy for small sizes.
Performance (unchanged from v1):
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
larger sizes.
For very small sizes (<8 bytes), there can be minor regressions compared
to the generic C version. This is a trade off for the significant gains
on larger sizes.
Additional Notes:
It was mentioned during internal discussions that, according to the XuanTie C908 user
manuals [1], vector memory operations must not target addresses with
Strong Order (SO) attributes, otherwise the system may crash. In the
current implementation, this scenario is handled by OpenSBI fallback
mechanisms, but this leads to degraded performance compared to scalar
implementations.
I reviewed other existing vectorized mem* patches and did not find
explicit handling for this case. Introducing explicit attribute checks
would likely add extra overhead. Therefore, I am currently uncertain
whether special handling for XuanTie CPUs should be included in this
patch or addressed separately.
Testing:
- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
- Spacemit X60 with V extension support.
- XuanTie C908 with V extension support.
- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
Functional behavior matches generic memset.
Thanks,
Pincheng Wang
[1] https://occ-oss-prod.oss-cn-hangzhou.aliyuncs.com/resource//1752575352271/%E7%8E%84%E9%93%81C908R1S0%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C%28xrvm%29_Rev.21_20250715.pdf Only in Chinese, relavant contents are in Chapter 8.
Pincheng Wang (1):
riscv64: add runtime-detected vector optimized memset
arch/riscv64/arch.mak | 12 +++
src/env/__libc_start_main.c | 3 +
src/internal/libc.h | 1 +
src/string/riscv64/memset.S | 134 +++++++++++++++++++++++++++
src/string/riscv64/memset_dispatch.c | 38 ++++++++
5 files changed, 188 insertions(+)
create mode 100644 arch/riscv64/arch.mak
create mode 100644 src/string/riscv64/memset.S
create mode 100644 src/string/riscv64/memset_dispatch.c
--
2.39.5
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-10-23 16:06 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 15:53 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
2025-10-23 15:53 ` [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset Pincheng Wang
-- strict thread matches above, loose matches on Subject: below --
2025-10-23 16:06 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).