mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-10-23 15:53 Pincheng Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Pincheng Wang @ 2025-10-23 15:53 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

Hi all,

This is v2 of the RISC-V Vector (RVV) optimized memset patch. I'm
resending it because I forgot to commit the latest changes to Git. Sorry
for the inconvenience!

Changes from v1:
- Replaced compile-time detetion (__riscv_vector macro) with runtime
  detection using AT_HWCAP, addressing the main concern from v1
  feedback.
- Introduced a dispatch mechanism (memset_dispatch.c) that selects
  appropriate implementation at process startup.
- Added arch.mak configuration to prevent GCC auto-vectorization on
  other string functions, ensuring only our runtime-detected code uses
  vector insturctions.
- Single binary now works correctly on both RVV and non-RVV hardware
  when built with CFLAGS+="-march=rv64gcv".

Implementation details:
- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
- memset_dispatch.c exports memset() which dispatches via function
  pointer.
- __init_riscv_string_optimizations() is called in __libc_start_main to
  initialize the function pointer based on AT_HWCAP.
- The vector implementation uses vsetvli for bulk fills and a head-tail
  strategy for small sizes.

Performance (unchanged from v1):
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
  across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
  larger sizes.

For very small sizes (<8 bytes), there can be minor regressions compared
to the generic C version. This is a trade off for the significant gains
on larger sizes.

Additional Notes:
It was mentioned during internal discussions that, according to the XuanTie C908 user
manuals [1], vector memory operations must not target addresses with
Strong Order (SO) attributes, otherwise the system may crash. In the
current implementation, this scenario is handled by OpenSBI fallback
mechanisms, but this leads to degraded performance compared to scalar
implementations.
I reviewed other existing vectorized mem* patches and did not find
explicit handling for this case. Introducing explicit attribute checks
would likely add extra overhead. Therefore, I am currently uncertain
whether special handling for XuanTie CPUs should be included in this
patch or addressed separately.

Testing:
- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
- Spacemit X60 with V extension support.
- XuanTie C908 with V extension support.
- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
Functional behavior matches generic memset.

Thanks,
Pincheng Wang

Pincheng Wang (1):
  riscv64: add runtime-detected vector optimized memset

 0000-cover-letter.patch                       |  79 ++++++
 0000-cover-letter.patch.bak                   |  79 ++++++
 ...ime-detected-vector-optimized-memset.patch | 259 ++++++++++++++++++
 arch/riscv64/arch.mak                         |  12 +
 src/env/__libc_start_main.c                   |   3 +
 src/internal/libc.h                           |   1 +
 src/string/riscv64/memset.S                   | 134 +++++++++
 src/string/riscv64/memset_dispatch.c          |  38 +++
 test_memset                                   | Bin 0 -> 8824 bytes
 9 files changed, 605 insertions(+)
 create mode 100644 0000-cover-letter.patch
 create mode 100644 0000-cover-letter.patch.bak
 create mode 100644 0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
 create mode 100644 arch/riscv64/arch.mak
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memset_dispatch.c
 create mode 100755 test_memset

-- 
2.39.5


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-10-23 16:06 Pincheng Wang
  2025-10-23 16:06 ` [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset Pincheng Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Pincheng Wang @ 2025-10-23 16:06 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

Hi all,

This is v2 of the RISC-V Vector (RVV) optimized memset patch. Resending
because forgot to commit the latest changes to Git. Sincerely sorry for the
inconvenience of messing up the mailing list and please take this
version as the correct one.

Changes from v1:
- Replaced compile-time detetion (__riscv_vector macro) with runtime
  detection using AT_HWCAP, addressing the main concern from v1
  feedback.
- Introduced a dispatch mechanism (memset_dispatch.c) that selects
  appropriate implementation at process startup.
- Added arch.mak configuration to prevent GCC auto-vectorization on
  other string functions, ensuring only our runtime-detected code uses
  vector insturctions.
- Single binary now works correctly on both RVV and non-RVV hardware
  when built with CFLAGS+="-march=rv64gcv".

Implementation details:
- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
- memset_dispatch.c exports memset() which dispatches via function
  pointer.
- __init_riscv_string_optimizations() is called in __libc_start_main to
  initialize the function pointer based on AT_HWCAP.
- The vector implementation uses vsetvli for bulk fills and a head-tail
  strategy for small sizes.

Performance (unchanged from v1):
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
  across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
  larger sizes.

For very small sizes (<8 bytes), there can be minor regressions compared
to the generic C version. This is a trade off for the significant gains
on larger sizes.

Additional Notes:
It was mentioned during internal discussions that, according to the XuanTie C908 user
manuals [1], vector memory operations must not target addresses with
Strong Order (SO) attributes, otherwise the system may crash. In the
current implementation, this scenario is handled by OpenSBI fallback
mechanisms, but this leads to degraded performance compared to scalar
implementations.
I reviewed other existing vectorized mem* patches and did not find
explicit handling for this case. Introducing explicit attribute checks
would likely add extra overhead. Therefore, I am currently uncertain
whether special handling for XuanTie CPUs should be included in this
patch or addressed separately.

Testing:
- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
- Spacemit X60 with V extension support.
- XuanTie C908 with V extension support.
- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
Functional behavior matches generic memset.

Thanks,
Pincheng Wang

[1] https://occ-oss-prod.oss-cn-hangzhou.aliyuncs.com/resource//1752575352271/%E7%8E%84%E9%93%81C908R1S0%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C%28xrvm%29_Rev.21_20250715.pdf Only in Chinese, relavant contents are in Chapter 8.

Pincheng Wang (1):
  riscv64: add runtime-detected vector optimized memset

 arch/riscv64/arch.mak                |  12 +++
 src/env/__libc_start_main.c          |   3 +
 src/internal/libc.h                  |   1 +
 src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
 src/string/riscv64/memset_dispatch.c |  38 ++++++++
 5 files changed, 188 insertions(+)
 create mode 100644 arch/riscv64/arch.mak
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memset_dispatch.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset
  2025-10-23 16:06 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
@ 2025-10-23 16:06 ` Pincheng Wang
  2025-10-23 20:56   ` Stefan O'Rear
  0 siblings, 1 reply; 5+ messages in thread
From: Pincheng Wang @ 2025-10-23 16:06 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

Add a RISC-V vector extension optimized memset implementation with
runtime CPU capability detection via HW_CAP.

The implementation provides both vector and scalar variants in a single
binary. At process startup, __init_riscv_string_optimizations() queries
AT_HWCAP to detect RVV support and selects the appropriate
implementation via function pointer dispatch. This allows the same libc
to run correctly on both vector-capable and non-vector RISC-V CPUs.

The vector implementation uses vsetvli for dynamic vector length and
employs a head-tail filling strategy for small sizes to minimize
overhead.

To prevent illegal instruction error, arch.mak disables compiler
auto-vectorization globally except for memset.S, ensuring only the
runtime-detected code uses vector instructions.

Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
---
 arch/riscv64/arch.mak                |  12 +++
 src/env/__libc_start_main.c          |   3 +
 src/internal/libc.h                  |   1 +
 src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
 src/string/riscv64/memset_dispatch.c |  38 ++++++++
 5 files changed, 188 insertions(+)
 create mode 100644 arch/riscv64/arch.mak
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memset_dispatch.c

diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
new file mode 100644
index 00000000..5978eb0a
--- /dev/null
+++ b/arch/riscv64/arch.mak
@@ -0,0 +1,12 @@
+# Disable tree vectorization for all files except memset.S
+
+# Reason: We have hand-optimized vector memset.S that uses runtime detection
+# to switch between scalar and vector implementations based on CPU capability.
+# However, GCC may auto-vectorize other functions (like memcpy, strcpy, etc.)
+# which would cause illegal instruction errors on CPUs without vector extensions.
+
+# Therefore, we disable auto-vectorization for all files except memset.S,
+# ensuring only our runtime-detected vector code uses vector instructions.
+
+# Add -fno-tree-vectorize to all object files except memset.S
+$(filter-out obj/src/string/riscv64/memset.o obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += -fno-tree-vectorize
diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
index c5b277bd..c23e63f4 100644
--- a/src/env/__libc_start_main.c
+++ b/src/env/__libc_start_main.c
@@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
 
 	__init_tls(aux);
 	__init_ssp((void *)aux[AT_RANDOM]);
+#ifdef __riscv
+	__init_riscv_string_optimizations();
+#endif
 
 	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
 		&& !aux[AT_SECURE]) return;
diff --git a/src/internal/libc.h b/src/internal/libc.h
index 619bba86..28e893a1 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
 hidden void __init_libc(char **, char *);
 hidden void __init_tls(size_t *);
 hidden void __init_ssp(void *);
+hidden void __init_riscv_string_optimizations(void);
 hidden void __libc_start_init(void);
 hidden void __funcs_on_exit(void);
 hidden void __funcs_on_quick_exit(void);
diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 00000000..8568f32b
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,134 @@
+#ifdef __riscv_vector
+
+    .text
+    .global memset_vect
+/* void *memset_vect(void *s, int c, size_t n)
+ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
+ * Returns a0.
+ */
+memset_vect:
+    mv      t0, a0                    /* running dst; keep a0 as return */
+    beqz    a2, .Ldone_vect           /* n == 0 then return */
+
+    li      t3, 8
+    bltu    a2, t3, .Lsmall_vect      /* small-size fast path */
+
+    /* Broadcast fill byte once. */
+    vsetvli t1, zero, e8, m8, ta, ma
+    vmv.v.x v0, a1
+
+.Lbulk_vect:
+    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
+    vse8.v  v0, (t0)
+    add     t0, t0, t1
+    sub     a2, a2, t1
+    bnez    a2, .Lbulk_vect
+    j       .Ldone_vect
+
+/* Small-size fast path (< 8).
+ * Head-tail fills to minimize branches and avoid vsetvli overhead.
+ */
+.Lsmall_vect:
+    /* Fill s[0], s[n-1] */
+    sb      a1, 0(t0)
+    add     t2, t0, a2
+    sb      a1, -1(t2)
+    li      t3, 2
+    bleu    a2, t3, .Ldone_vect
+
+    /* Fill s[1], s[2], s[n-2], s[n-3] */
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_vect
+
+    /* Fill s[3], s[n-4] */
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    /* fallthrough for n <= 8 */
+
+.Ldone_vect:
+    ret
+.size memset_vect, .-memset_vect
+
+    .text
+    .global memset_scalar
+memset_scalar:
+    mv      t0, a0                    
+    beqz    a2, .Ldone_scalar
+
+    andi    a1, a1, 0xff              
+
+    sb      a1, 0(t0)                 
+    add     t2, t0, a2
+    sb      a1, -1(t2)                
+    li      t3, 2
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    li      t3, 8
+    bleu    a2, t3, .Ldone_scalar
+
+    addi    t4, t0, 4
+    addi    t5, t2, -4
+.Lloop_scalar:
+    bgeu    t4, t5, .Ldone_scalar
+    sb      a1, 0(t4)
+    addi    t4, t4, 1
+    j       .Lloop_scalar
+
+.Ldone_scalar:
+    ret
+.size memset_scalar, .-memset_scalar
+
+#else
+
+    .text
+    .global memset_scalar
+memset_scalar:
+    mv      t0, a0                    
+    beqz    a2, .Ldone_scalar
+
+    andi    a1, a1, 0xff              
+
+    sb      a1, 0(t0)                 
+    add     t2, t0, a2
+    sb      a1, -1(t2)                
+    li      t3, 2
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    li      t3, 8
+    bleu    a2, t3, .Ldone_scalar
+
+    addi    t4, t0, 4
+    addi    t5, t2, -4
+.Lloop_scalar:
+    bgeu    t4, t5, .Ldone_scalar
+    sb      a1, 0(t4)
+    addi    t4, t4, 1
+    j       .Lloop_scalar
+
+.Ldone_scalar:
+    ret
+.size memset_scalar, .-memset_scalar
+
+#endif
diff --git a/src/string/riscv64/memset_dispatch.c b/src/string/riscv64/memset_dispatch.c
new file mode 100644
index 00000000..aadf19fb
--- /dev/null
+++ b/src/string/riscv64/memset_dispatch.c
@@ -0,0 +1,38 @@
+#include "libc.h"
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/auxv.h>
+
+void *memset_scalar(void *s, int c, size_t n);
+#ifdef __riscv_vector
+void *memset_vect(void *s, int c, size_t n);
+#endif
+
+/* Use scalar implementation by default */
+__attribute__((visibility("hidden")))
+void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
+
+void *memset(void *s, int c, size_t n)
+{
+	return __memset_ptr(s, c, n);
+}
+
+static int __has_rvv_via_hwcap(void)
+{
+	const unsigned long V_bit = (1ul << ('V' - 'A'));
+	unsigned long hwcap = getauxval(AT_HWCAP);
+	return (hwcap & V_bit) != 0;
+}
+
+__attribute__((visibility("hidden")))
+void __init_riscv_string_optimizations(void)
+{
+#ifdef __riscv_vector
+	if (__has_rvv_via_hwcap())
+		__memset_ptr = memset_vect;
+	else
+		__memset_ptr = memset_scalar;
+#else
+	__memset_ptr = memset_scalar;
+#endif
+}
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset
  2025-10-23 16:06 ` [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset Pincheng Wang
@ 2025-10-23 20:56   ` Stefan O'Rear
  2025-10-24  0:50     ` Pincheng Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan O'Rear @ 2025-10-23 20:56 UTC (permalink / raw)
  To: musl

On Thu, Oct 23, 2025, at 12:06 PM, Pincheng Wang wrote:
> Add a RISC-V vector extension optimized memset implementation with
> runtime CPU capability detection via HW_CAP.
>
> The implementation provides both vector and scalar variants in a single
> binary. At process startup, __init_riscv_string_optimizations() queries
> AT_HWCAP to detect RVV support and selects the appropriate
> implementation via function pointer dispatch. This allows the same libc
> to run correctly on both vector-capable and non-vector RISC-V CPUs.
>
> The vector implementation uses vsetvli for dynamic vector length and
> employs a head-tail filling strategy for small sizes to minimize
> overhead.
>
> To prevent illegal instruction error, arch.mak disables compiler
> auto-vectorization globally except for memset.S, ensuring only the
> runtime-detected code uses vector instructions.
>
> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
> ---
>  arch/riscv64/arch.mak                |  12 +++
>  src/env/__libc_start_main.c          |   3 +
>  src/internal/libc.h                  |   1 +
>  src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
>  src/string/riscv64/memset_dispatch.c |  38 ++++++++
>  5 files changed, 188 insertions(+)
>  create mode 100644 arch/riscv64/arch.mak
>  create mode 100644 src/string/riscv64/memset.S
>  create mode 100644 src/string/riscv64/memset_dispatch.c
>
> diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
> new file mode 100644
> index 00000000..5978eb0a
> --- /dev/null
> +++ b/arch/riscv64/arch.mak
> @@ -0,0 +1,12 @@
> +# Disable tree vectorization for all files except memset.S
> +
> +# Reason: We have hand-optimized vector memset.S that uses runtime 
> detection
> +# to switch between scalar and vector implementations based on CPU 
> capability.
> +# However, GCC may auto-vectorize other functions (like memcpy, 
> strcpy, etc.)
> +# which would cause illegal instruction errors on CPUs without vector 
> extensions.
> +
> +# Therefore, we disable auto-vectorization for all files except 
> memset.S,
> +# ensuring only our runtime-detected vector code uses vector 
> instructions.
> +
> +# Add -fno-tree-vectorize to all object files except memset.S
> +$(filter-out obj/src/string/riscv64/memset.o 
> obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += 
> -fno-tree-vectorize

This isn't sufficient to prevent gcc from generating vector instructions
for e.g. struct zeroing idioms.

It's also the wrong approach because it prevents people who _do_ know at
compile time that their target hardware has V from using it pervasively.

To be consistent with anything else being built with the same options,
-march in CFLAGS should be the minimum set of extensions that are known
at compile time to be available, not including any that must be detected
at runtime.  Then runtime-detected extensions can be made available within
contexts guarded by the runtime detection; this is fairly easy for assembly.

> diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
> index c5b277bd..c23e63f4 100644
> --- a/src/env/__libc_start_main.c
> +++ b/src/env/__libc_start_main.c
> @@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
> 
>  	__init_tls(aux);
>  	__init_ssp((void *)aux[AT_RANDOM]);
> +#ifdef __riscv
> +	__init_riscv_string_optimizations();
> +#endif
> 
>  	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
>  		&& !aux[AT_SECURE]) return;
> diff --git a/src/internal/libc.h b/src/internal/libc.h
> index 619bba86..28e893a1 100644
> --- a/src/internal/libc.h
> +++ b/src/internal/libc.h
> @@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
>  hidden void __init_libc(char **, char *);
>  hidden void __init_tls(size_t *);
>  hidden void __init_ssp(void *);
> +hidden void __init_riscv_string_optimizations(void);
>  hidden void __libc_start_init(void);
>  hidden void __funcs_on_exit(void);
>  hidden void __funcs_on_quick_exit(void);
> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
> new file mode 100644
> index 00000000..8568f32b
> --- /dev/null
> +++ b/src/string/riscv64/memset.S
> @@ -0,0 +1,134 @@
> +#ifdef __riscv_vector

unconditional

> +
> +    .text
> +    .global memset_vect
> +/* void *memset_vect(void *s, int c, size_t n)
> + * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
> + * Returns a0.
> + */

.option push
.option arch,+v

> +memset_vect:
> +    mv      t0, a0                    /* running dst; keep a0 as 
> return */
> +    beqz    a2, .Ldone_vect           /* n == 0 then return */
> +
> +    li      t3, 8
> +    bltu    a2, t3, .Lsmall_vect      /* small-size fast path */
> +
> +    /* Broadcast fill byte once. */
> +    vsetvli t1, zero, e8, m8, ta, ma
> +    vmv.v.x v0, a1
> +
> +.Lbulk_vect:
> +    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
> +    vse8.v  v0, (t0)
> +    add     t0, t0, t1
> +    sub     a2, a2, t1
> +    bnez    a2, .Lbulk_vect
> +    j       .Ldone_vect
> +
> +/* Small-size fast path (< 8).
> + * Head-tail fills to minimize branches and avoid vsetvli overhead.
> + */
> +.Lsmall_vect:

Compilers will generate inline code for memset(s,c,n) where n is a small
constant.  The only reason for memset to actually be called with small n
is if n is variable. I suspect that on real code it will typically be
faster to use vector instructions for small memsets because of the
avoided branch mispredictions.

> +    /* Fill s[0], s[n-1] */
> +    sb      a1, 0(t0)
> +    add     t2, t0, a2
> +    sb      a1, -1(t2)
> +    li      t3, 2
> +    bleu    a2, t3, .Ldone_vect
> +
> +    /* Fill s[1], s[2], s[n-2], s[n-3] */
> +    sb      a1, 1(t0)
> +    sb      a1, 2(t0)
> +    sb      a1, -2(t2)
> +    sb      a1, -3(t2)
> +    li      t3, 6
> +    bleu    a2, t3, .Ldone_vect
> +
> +    /* Fill s[3], s[n-4] */
> +    sb      a1, 3(t0)
> +    sb      a1, -4(t2)
> +    /* fallthrough for n <= 8 */
> +
> +.Ldone_vect:
> +    ret
> +.size memset_vect, .-memset_vect

.option pop

> +    .text
> +    .global memset_scalar
> +memset_scalar:
> +    mv      t0, a0                    
> +    beqz    a2, .Ldone_scalar
> +
> +    andi    a1, a1, 0xff              
> +
> +    sb      a1, 0(t0)                 
> +    add     t2, t0, a2
> +    sb      a1, -1(t2)                
> +    li      t3, 2
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    sb      a1, 1(t0)
> +    sb      a1, 2(t0)
> +    sb      a1, -2(t2)
> +    sb      a1, -3(t2)
> +    li      t3, 6
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    sb      a1, 3(t0)
> +    sb      a1, -4(t2)
> +    li      t3, 8
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    addi    t4, t0, 4
> +    addi    t5, t2, -4
> +.Lloop_scalar:
> +    bgeu    t4, t5, .Ldone_scalar
> +    sb      a1, 0(t4)
> +    addi    t4, t4, 1
> +    j       .Lloop_scalar
> +
> +.Ldone_scalar:
> +    ret
> +.size memset_scalar, .-memset_scalar
> +
> +#else
> +
> +    .text
> +    .global memset_scalar
> +memset_scalar:
> +    mv      t0, a0                    
> +    beqz    a2, .Ldone_scalar
> +
> +    andi    a1, a1, 0xff              
> +
> +    sb      a1, 0(t0)                 
> +    add     t2, t0, a2
> +    sb      a1, -1(t2)                
> +    li      t3, 2
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    sb      a1, 1(t0)
> +    sb      a1, 2(t0)
> +    sb      a1, -2(t2)
> +    sb      a1, -3(t2)
> +    li      t3, 6
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    sb      a1, 3(t0)
> +    sb      a1, -4(t2)
> +    li      t3, 8
> +    bleu    a2, t3, .Ldone_scalar
> +
> +    addi    t4, t0, 4
> +    addi    t5, t2, -4
> +.Lloop_scalar:
> +    bgeu    t4, t5, .Ldone_scalar
> +    sb      a1, 0(t4)
> +    addi    t4, t4, 1
> +    j       .Lloop_scalar
> +
> +.Ldone_scalar:
> +    ret
> +.size memset_scalar, .-memset_scalar
> +
> +#endif
> diff --git a/src/string/riscv64/memset_dispatch.c 
> b/src/string/riscv64/memset_dispatch.c
> new file mode 100644
> index 00000000..aadf19fb
> --- /dev/null
> +++ b/src/string/riscv64/memset_dispatch.c
> @@ -0,0 +1,38 @@
> +#include "libc.h"
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <sys/auxv.h>
> +
> +void *memset_scalar(void *s, int c, size_t n);
> +#ifdef __riscv_vector
> +void *memset_vect(void *s, int c, size_t n);
> +#endif
> +
> +/* Use scalar implementation by default */

Not really a default since it's always set by __init_libc.  It does
control the memset implementation used in dynlink.c prior to _start.

> +__attribute__((visibility("hidden")))
> +void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
> +
> +void *memset(void *s, int c, size_t n)
> +{
> +	return __memset_ptr(s, c, n);
> +}
> +
> +static int __has_rvv_via_hwcap(void)
> +{
> +	const unsigned long V_bit = (1ul << ('V' - 'A'));
> +	unsigned long hwcap = getauxval(AT_HWCAP);

getauxval is not a reserved identifier in C and it can be overridden by a
symbol from the main program with a different meaning. Use __getauxval.

You might want to rename __memset_scalar and __memset_vector for the same
reason, but mem* are "potentially reserved identifiers" so this isn't
strictly required.

> +	return (hwcap & V_bit) != 0;
> +}
> +
> +__attribute__((visibility("hidden")))
> +void __init_riscv_string_optimizations(void)
> +{
> +#ifdef __riscv_vector
> +	if (__has_rvv_via_hwcap())
> +		__memset_ptr = memset_vect;
> +	else
> +		__memset_ptr = memset_scalar;
> +#else
> +	__memset_ptr = memset_scalar;
> +#endif
> +}
> -- 
> 2.39.5

-s

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset
  2025-10-23 20:56   ` Stefan O'Rear
@ 2025-10-24  0:50     ` Pincheng Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Pincheng Wang @ 2025-10-24  0:50 UTC (permalink / raw)
  To: musl, Stefan O'Rear

On 2025/10/24 04:56, Stefan O'Rear wrote:
> On Thu, Oct 23, 2025, at 12:06 PM, Pincheng Wang wrote:
>> Add a RISC-V vector extension optimized memset implementation with
>> runtime CPU capability detection via HW_CAP.
>>
>> The implementation provides both vector and scalar variants in a single
>> binary. At process startup, __init_riscv_string_optimizations() queries
>> AT_HWCAP to detect RVV support and selects the appropriate
>> implementation via function pointer dispatch. This allows the same libc
>> to run correctly on both vector-capable and non-vector RISC-V CPUs.
>>
>> The vector implementation uses vsetvli for dynamic vector length and
>> employs a head-tail filling strategy for small sizes to minimize
>> overhead.
>>
>> To prevent illegal instruction error, arch.mak disables compiler
>> auto-vectorization globally except for memset.S, ensuring only the
>> runtime-detected code uses vector instructions.
>>
>> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
>> ---
>>   arch/riscv64/arch.mak                |  12 +++
>>   src/env/__libc_start_main.c          |   3 +
>>   src/internal/libc.h                  |   1 +
>>   src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
>>   src/string/riscv64/memset_dispatch.c |  38 ++++++++
>>   5 files changed, 188 insertions(+)
>>   create mode 100644 arch/riscv64/arch.mak
>>   create mode 100644 src/string/riscv64/memset.S
>>   create mode 100644 src/string/riscv64/memset_dispatch.c
>>
>> diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
>> new file mode 100644
>> index 00000000..5978eb0a
>> --- /dev/null
>> +++ b/arch/riscv64/arch.mak
>> @@ -0,0 +1,12 @@
>> +# Disable tree vectorization for all files except memset.S
>> +
>> +# Reason: We have hand-optimized vector memset.S that uses runtime
>> detection
>> +# to switch between scalar and vector implementations based on CPU
>> capability.
>> +# However, GCC may auto-vectorize other functions (like memcpy,
>> strcpy, etc.)
>> +# which would cause illegal instruction errors on CPUs without vector
>> extensions.
>> +
>> +# Therefore, we disable auto-vectorization for all files except
>> memset.S,
>> +# ensuring only our runtime-detected vector code uses vector
>> instructions.
>> +
>> +# Add -fno-tree-vectorize to all object files except memset.S
>> +$(filter-out obj/src/string/riscv64/memset.o
>> obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL +=
>> -fno-tree-vectorize
> 
> This isn't sufficient to prevent gcc from generating vector instructions
> for e.g. struct zeroing idioms.
> 
> It's also the wrong approach because it prevents people who _do_ know at
> compile time that their target hardware has V from using it pervasively.
> 
> To be consistent with anything else being built with the same options,
> -march in CFLAGS should be the minimum set of extensions that are known
> at compile time to be available, not including any that must be detected
> at runtime.  Then runtime-detected extensions can be made available within
> contexts guarded by the runtime detection; this is fairly easy for assembly.
> 

Thank you for the feedback. I agree that the current approach is rather 
simplistic, which fails to address the issue completely and also blocks 
potential optimizations for targets where vector extensions are known at 
compile time.

In the next revision, I'll remove this workaround and follow a cleaner 
solution as you suggested: keeping `-march` minimal and enabling vector 
instructions only in runtime-guarded paths.

>> diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
>> index c5b277bd..c23e63f4 100644
>> --- a/src/env/__libc_start_main.c
>> +++ b/src/env/__libc_start_main.c
>> @@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
>>
>>   	__init_tls(aux);
>>   	__init_ssp((void *)aux[AT_RANDOM]);
>> +#ifdef __riscv
>> +	__init_riscv_string_optimizations();
>> +#endif
>>
>>   	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
>>   		&& !aux[AT_SECURE]) return;
>> diff --git a/src/internal/libc.h b/src/internal/libc.h
>> index 619bba86..28e893a1 100644
>> --- a/src/internal/libc.h
>> +++ b/src/internal/libc.h
>> @@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
>>   hidden void __init_libc(char **, char *);
>>   hidden void __init_tls(size_t *);
>>   hidden void __init_ssp(void *);
>> +hidden void __init_riscv_string_optimizations(void);
>>   hidden void __libc_start_init(void);
>>   hidden void __funcs_on_exit(void);
>>   hidden void __funcs_on_quick_exit(void);
>> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
>> new file mode 100644
>> index 00000000..8568f32b
>> --- /dev/null
>> +++ b/src/string/riscv64/memset.S
>> @@ -0,0 +1,134 @@
>> +#ifdef __riscv_vector
> 
> unconditional
> 
>> +
>> +    .text
>> +    .global memset_vect
>> +/* void *memset_vect(void *s, int c, size_t n)
>> + * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
>> + * Returns a0.
>> + */
> 
> .option push
> .option arch,+v
> 

Will fix these in the next revision, thanks.

>> +memset_vect:
>> +    mv      t0, a0                    /* running dst; keep a0 as
>> return */
>> +    beqz    a2, .Ldone_vect           /* n == 0 then return */
>> +
>> +    li      t3, 8
>> +    bltu    a2, t3, .Lsmall_vect      /* small-size fast path */
>> +
>> +    /* Broadcast fill byte once. */
>> +    vsetvli t1, zero, e8, m8, ta, ma
>> +    vmv.v.x v0, a1
>> +
>> +.Lbulk_vect:
>> +    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
>> +    vse8.v  v0, (t0)
>> +    add     t0, t0, t1
>> +    sub     a2, a2, t1
>> +    bnez    a2, .Lbulk_vect
>> +    j       .Ldone_vect
>> +
>> +/* Small-size fast path (< 8).
>> + * Head-tail fills to minimize branches and avoid vsetvli overhead.
>> + */
>> +.Lsmall_vect:
> 
> Compilers will generate inline code for memset(s,c,n) where n is a small
> constant.  The only reason for memset to actually be called with small n
> is if n is variable. I suspect that on real code it will typically be
> faster to use vector instructions for small memsets because of the
> avoided branch mispredictions.
> 

The head-tail strategy was originally chosen because, on my testing 
hardware, vector instructions performed worse than the generic C 
implementation for small sizes (<8 bytes). Even with the head-tail 
approach, the performance improvement in this range was minimal and 
still slightly behind the C implementation.

Considering the trade-offs in branch misprediction and code size, I 
agree that a pure vector implementation may be more beneficial in 
practice. I'll revise the implementation accordingly in the next version.

>> +    /* Fill s[0], s[n-1] */
>> +    sb      a1, 0(t0)
>> +    add     t2, t0, a2
>> +    sb      a1, -1(t2)
>> +    li      t3, 2
>> +    bleu    a2, t3, .Ldone_vect
>> +
>> +    /* Fill s[1], s[2], s[n-2], s[n-3] */
>> +    sb      a1, 1(t0)
>> +    sb      a1, 2(t0)
>> +    sb      a1, -2(t2)
>> +    sb      a1, -3(t2)
>> +    li      t3, 6
>> +    bleu    a2, t3, .Ldone_vect
>> +
>> +    /* Fill s[3], s[n-4] */
>> +    sb      a1, 3(t0)
>> +    sb      a1, -4(t2)
>> +    /* fallthrough for n <= 8 */
>> +
>> +.Ldone_vect:
>> +    ret
>> +.size memset_vect, .-memset_vect
> 
> .option pop
> 

Will fix in the next revision, thanks.

>> +    .text
>> +    .global memset_scalar
>> +memset_scalar:
>> +    mv      t0, a0
>> +    beqz    a2, .Ldone_scalar
>> +
>> +    andi    a1, a1, 0xff
>> +
>> +    sb      a1, 0(t0)
>> +    add     t2, t0, a2
>> +    sb      a1, -1(t2)
>> +    li      t3, 2
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    sb      a1, 1(t0)
>> +    sb      a1, 2(t0)
>> +    sb      a1, -2(t2)
>> +    sb      a1, -3(t2)
>> +    li      t3, 6
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    sb      a1, 3(t0)
>> +    sb      a1, -4(t2)
>> +    li      t3, 8
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    addi    t4, t0, 4
>> +    addi    t5, t2, -4
>> +.Lloop_scalar:
>> +    bgeu    t4, t5, .Ldone_scalar
>> +    sb      a1, 0(t4)
>> +    addi    t4, t4, 1
>> +    j       .Lloop_scalar
>> +
>> +.Ldone_scalar:
>> +    ret
>> +.size memset_scalar, .-memset_scalar
>> +
>> +#else
>> +
>> +    .text
>> +    .global memset_scalar
>> +memset_scalar:
>> +    mv      t0, a0
>> +    beqz    a2, .Ldone_scalar
>> +
>> +    andi    a1, a1, 0xff
>> +
>> +    sb      a1, 0(t0)
>> +    add     t2, t0, a2
>> +    sb      a1, -1(t2)
>> +    li      t3, 2
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    sb      a1, 1(t0)
>> +    sb      a1, 2(t0)
>> +    sb      a1, -2(t2)
>> +    sb      a1, -3(t2)
>> +    li      t3, 6
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    sb      a1, 3(t0)
>> +    sb      a1, -4(t2)
>> +    li      t3, 8
>> +    bleu    a2, t3, .Ldone_scalar
>> +
>> +    addi    t4, t0, 4
>> +    addi    t5, t2, -4
>> +.Lloop_scalar:
>> +    bgeu    t4, t5, .Ldone_scalar
>> +    sb      a1, 0(t4)
>> +    addi    t4, t4, 1
>> +    j       .Lloop_scalar
>> +
>> +.Ldone_scalar:
>> +    ret
>> +.size memset_scalar, .-memset_scalar
>> +
>> +#endif
>> diff --git a/src/string/riscv64/memset_dispatch.c
>> b/src/string/riscv64/memset_dispatch.c
>> new file mode 100644
>> index 00000000..aadf19fb
>> --- /dev/null
>> +++ b/src/string/riscv64/memset_dispatch.c
>> @@ -0,0 +1,38 @@
>> +#include "libc.h"
>> +#include <stddef.h>
>> +#include <stdint.h>
>> +#include <sys/auxv.h>
>> +
>> +void *memset_scalar(void *s, int c, size_t n);
>> +#ifdef __riscv_vector
>> +void *memset_vect(void *s, int c, size_t n);
>> +#endif
>> +
>> +/* Use scalar implementation by default */
> 
> Not really a default since it's always set by __init_libc.  It does
> control the memset implementation used in dynlink.c prior to _start.
> 

The comment here is indeed misleading. What I intended to express is 
that the scalar implementation is used as the fallback before HWCAP 
becomes available, and the actual dispatch is performed once HWCAP can 
be checked. I'll update the comment in the next version to make this 
point clearer.

>> +__attribute__((visibility("hidden")))
>> +void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
>> +
>> +void *memset(void *s, int c, size_t n)
>> +{
>> +	return __memset_ptr(s, c, n);
>> +}
>> +
>> +static int __has_rvv_via_hwcap(void)
>> +{
>> +	const unsigned long V_bit = (1ul << ('V' - 'A'));
>> +	unsigned long hwcap = getauxval(AT_HWCAP);
> 
> getauxval is not a reserved identifier in C and it can be overridden by a
> symbol from the main program with a different meaning. Use __getauxval.
> 
> You might want to rename __memset_scalar and __memset_vector for the same
> reason, but mem* are "potentially reserved identifiers" so this isn't
> strictly required.
> 

Will rename both getauxval and memset_* in the next revision to avoid 
potential symbol conflicts.

>> +	return (hwcap & V_bit) != 0;
>> +}
>> +
>> +__attribute__((visibility("hidden")))
>> +void __init_riscv_string_optimizations(void)
>> +{
>> +#ifdef __riscv_vector
>> +	if (__has_rvv_via_hwcap())
>> +		__memset_ptr = memset_vect;
>> +	else
>> +		__memset_ptr = memset_scalar;
>> +#else
>> +	__memset_ptr = memset_scalar;
>> +#endif
>> +}
>> -- 
>> 2.39.5
> 
> -s

Thank again for the detailed review comments.

Best regards,
Pincheng Wang


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-24  0:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 16:06 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
2025-10-23 16:06 ` [musl] [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset Pincheng Wang
2025-10-23 20:56   ` Stefan O'Rear
2025-10-24  0:50     ` Pincheng Wang
  -- strict thread matches above, loose matches on Subject: below --
2025-10-23 15:53 [musl] [PATCH v2 resend 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).