mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-09-25 13:15 Pincheng Wang
  2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

Hi all,

This patch introduces a RISC-V Vector (RVV) optimized implementation of
memset.

Key points:
- Use RVV instructions to fill memory in bulk, with a small-size
  head-tail fast path to reduce vsetvli overhead.
- Fall back to a scalar head-tail implementation (like generic C
  implementation) when RVV is not available.
- Reduce both instruction count and code size: memset.o shrinks by about
  16.5% compared to the generic C build.

Performance results on RVV-capable hardware show clear improvements:
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
  across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
  larger sizes.

For very small sizes (<8 Bytes), there can be regressions compared to
the generic C version. A more aggresive fast path could remove these
regressions, but at the cost of added code complexity. Feedback on this
trade-off is welcome.

The implementation was tested under QEMU with RVV enabled and on real
hardware. Functional behavior matches the generic memset, with no
changes to the public interface.

Thanks,
Pincheng Wang

Pincheng Wang (1):
  riscv64: optimize memset implementation with vector extension

 src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100644 src/string/riscv64/memset.S

-- 
2.39.5


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
  2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
@ 2025-09-25 13:15 ` Pincheng Wang
  2025-09-25 15:30   ` Yao Zi
  0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

Use head-tail filling strategy for small sizes and dynamic vsetvli
approach for vector loops to reduce branch overhead. Add conditional
compilation to fall back to scalar implementation when __riscv_vector is
not available.

Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
---
 src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100644 src/string/riscv64/memset.S

diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 00000000..5fc6ee14
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,101 @@
+#ifdef __riscv_vector
+
+    .text
+    .global memset
+/* void *memset(void *s, int c, size_t n)
+ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
+ * Returns a0.
+ */
+memset:
+    mv      t0, a0                    /* running dst; keep a0 as return */
+    beqz    a2, .Ldone                /* n == 0 → return */
+
+    li      t3, 8
+    bltu    a2, t3, .Lsmall           /* small-size fast path */
+
+    /* Broadcast fill byte once. */
+    vsetvli t1, zero, e8, m8, ta, ma
+    vmv.v.x v0, a1
+
+.Lbulk:
+    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
+    vse8.v  v0, (t0)
+    add     t0, t0, t1
+    sub     a2, a2, t1
+    bnez    a2, .Lbulk
+    j       .Ldone
+
+/* Small-size fast path (< 8).
+ * Head-tail fills to minimize branches and avoid vsetvli overhead.
+ */
+.Lsmall:
+    /* Fill s[0], s[n-1] */
+    sb      a1, 0(t0)
+    add     t2, t0, a2
+    sb      a1, -1(t2)
+    li      t3, 2
+    bleu    a2, t3, .Ldone
+
+    /* Fill s[1], s[2], s[n-2], s[n-3] */
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone
+
+    /* Fill s[3], s[n-4] */
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    /* fallthrough for n <= 8 */
+
+.Ldone:
+    ret
+.size memset, .-memset
+
+#else  /* !__riscv_vector */
+
+    .text
+    .global memset
+/* Fallback scalar memset
+ * void *memset(void *s, int c, size_t n)
+ */
+memset:
+    mv      t0, a0                    /* running dst; keep a0 as return */
+    beqz    a2, .Ldone
+
+    andi    a1, a1, 0xff              /* use low 8 bits only */
+
+    /* Head-tail strategy for small n */
+    sb      a1, 0(t0)                 /* s[0] */
+    add     t2, t0, a2
+    sb      a1, -1(t2)                /* s[n-1] */
+    li      t3, 2
+    bleu    a2, t3, .Ldone
+
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone
+
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    li      t3, 8
+    bleu    a2, t3, .Ldone
+
+    /* Linear fill middle region [4, n-4) */
+    addi    t4, t0, 4
+    addi    t5, t2, -4
+.Lloop:
+    bgeu    t4, t5, .Ldone
+    sb      a1, 0(t4)
+    addi    t4, t4, 1
+    j       .Lloop
+
+.Ldone:
+    ret
+.size memset, .-memset
+
+#endif
-- 
2.39.5


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
  2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
@ 2025-09-25 15:30   ` Yao Zi
  2025-09-26  0:31     ` Pincheng Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Yao Zi @ 2025-09-25 15:30 UTC (permalink / raw)
  To: musl; +Cc: pincheng.plct

On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote:
> Use head-tail filling strategy for small sizes and dynamic vsetvli
> approach for vector loops to reduce branch overhead. Add conditional
> compilation to fall back to scalar implementation when __riscv_vector is
> not available.
> 
> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
> ---
>  src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
>  create mode 100644 src/string/riscv64/memset.S
> 
> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
> new file mode 100644
> index 00000000..5fc6ee14
> --- /dev/null
> +++ b/src/string/riscv64/memset.S
> @@ -0,0 +1,101 @@
> +#ifdef __riscv_vector

I don't think musl is built with V extension specified in march on
RISC-V platforms by default. Does this patch only benefit builds that
"-march=rv64gcv" is manually specified in CFLAGS?

Furthermore, having RVV available at compilation-time doesn't mean it's
available at runtime. This effectively raises the baseline for RISC-V
platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't
implied by the mostly-adapted RVA20 profile.

Best regards,
Yao Zi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
  2025-09-25 15:30   ` Yao Zi
@ 2025-09-26  0:31     ` Pincheng Wang
  2025-09-26  3:37       ` Markus Wichmann
  0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-26  0:31 UTC (permalink / raw)
  To: musl, Yao Zi

On 2025/9/25 23:30, Yao Zi wrote:
> On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote:
>> Use head-tail filling strategy for small sizes and dynamic vsetvli
>> approach for vector loops to reduce branch overhead. Add conditional
>> compilation to fall back to scalar implementation when __riscv_vector is
>> not available.
>>
>> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
>> ---
>>   src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 101 insertions(+)
>>   create mode 100644 src/string/riscv64/memset.S
>>
>> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
>> new file mode 100644
>> index 00000000..5fc6ee14
>> --- /dev/null
>> +++ b/src/string/riscv64/memset.S
>> @@ -0,0 +1,101 @@
>> +#ifdef __riscv_vector
> 
> I don't think musl is built with V extension specified in march on
> RISC-V platforms by default. Does this patch only benefit builds that
> "-march=rv64gcv" is manually specified in CFLAGS?
> 
> Furthermore, having RVV available at compilation-time doesn't mean it's
> available at runtime. This effectively raises the baseline for RISC-V
> platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't
> implied by the mostly-adapted RVA20 profile.
> 
> Best regards,
> Yao Zi

Hi, Yao

Thank you for your review. This patch currently only takes effect when 
`-march=rv64gcv` is manually specified in CFLAGS. I also understand your 
concern about enabling the vector implementation purely through 
compile-time conditionals.

I am investigating a runtime detection and dispatch mechanism to select 
the appropriate implementation based on actual hardware support. If I 
make progress on this and verify it works as expected, I will update the 
approach in a v2 patch.

Best regards,
Pincheng Wang


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
  2025-09-26  0:31     ` Pincheng Wang
@ 2025-09-26  3:37       ` Markus Wichmann
  2025-09-26 11:21         ` Pincheng Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Wichmann @ 2025-09-26  3:37 UTC (permalink / raw)
  To: musl; +Cc: Yao Zi

Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang:
> I am investigating a runtime detection and dispatch mechanism to select the
> appropriate implementation based on actual hardware support. If I make
> progress on this and verify it works as expected, I will update the approach
> in a v2 patch.
> 

There seems to be a hwcap flag for ISA_V that might be usable. Not sure
if it is any help for memcpy() though, because memcpy() is a function
GCC can insert calls to at any point, including the dynamic linker
stages 1 and 2, and references to libc.hwcap are invalid there. The only
solution I see is to explicitly switch to the optimized version after it
is possible to do so and use the generic version for starters. But that
would be a bit of a larger change to the code base.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
  2025-09-26  3:37       ` Markus Wichmann
@ 2025-09-26 11:21         ` Pincheng Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Pincheng Wang @ 2025-09-26 11:21 UTC (permalink / raw)
  To: musl, Markus Wichmann; +Cc: Yao Zi

On 2025/9/26 11:37, Markus Wichmann wrote:
> Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang:
>> I am investigating a runtime detection and dispatch mechanism to select the
>> appropriate implementation based on actual hardware support. If I make
>> progress on this and verify it works as expected, I will update the approach
>> in a v2 patch.
>>
> 
> There seems to be a hwcap flag for ISA_V that might be usable. Not sure
> if it is any help for memcpy() though, because memcpy() is a function
> GCC can insert calls to at any point, including the dynamic linker
> stages 1 and 2, and references to libc.hwcap are invalid there. The only
> solution I see is to explicitly switch to the optimized version after it
> is possible to do so and use the generic version for starters. But that
> would be a bit of a larger change to the code base.
> 
> Ciao,
> Markus

Hi, Markus

Thanks for your suggestion! I do plan to use hwcap to detect the V 
extension, similar to implementations on other platforms. Your point 
about hwcap being unavailable during the linker's bootstrap stage is 
indeed very helpful - I hadn't fully considered that. I'll look into 
adopting your suggestion: using a generic implementation initially and 
explicitly switching to the optimized version once hwcap becomes 
available. However, I can't yet be certain about the exact 
implementation details, as this will require some time for coding and 
testing.

Moreover, since I haven't found any existing RISC-V examples of such 
detection-and-dispatch mechanisms, and given that this approach would 
indeed entail significant changes to the dynamic linker, I plan to send 
the v2 patch as an RFC after I finish my current work.

Thank you again for your insightful comments and suggestions!

Best regards,
Pincheng Wang


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-26 11:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
2025-09-25 15:30   ` Yao Zi
2025-09-26  0:31     ` Pincheng Wang
2025-09-26  3:37       ` Markus Wichmann
2025-09-26 11:21         ` Pincheng Wang

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).