* [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation
@ 2025-09-25 13:15 Pincheng Wang
2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
Hi all,
This patch introduces a RISC-V Vector (RVV) optimized implementation of
memset.
Key points:
- Use RVV instructions to fill memory in bulk, with a small-size
head-tail fast path to reduce vsetvli overhead.
- Fall back to a scalar head-tail implementation (like generic C
implementation) when RVV is not available.
- Reduce both instruction count and code size: memset.o shrinks by about
16.5% compared to the generic C build.
Performance results on RVV-capable hardware show clear improvements:
- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
across medium and large sizes.
- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
larger sizes.
For very small sizes (<8 Bytes), there can be regressions compared to
the generic C version. A more aggresive fast path could remove these
regressions, but at the cost of added code complexity. Feedback on this
trade-off is welcome.
The implementation was tested under QEMU with RVV enabled and on real
hardware. Functional behavior matches the generic memset, with no
changes to the public interface.
Thanks,
Pincheng Wang
Pincheng Wang (1):
riscv64: optimize memset implementation with vector extension
src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
1 file changed, 101 insertions(+)
create mode 100644 src/string/riscv64/memset.S
--
2.39.5
^ permalink raw reply [flat|nested] 6+ messages in thread
* [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
@ 2025-09-25 13:15 ` Pincheng Wang
2025-09-25 15:30 ` Yao Zi
0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
Use head-tail filling strategy for small sizes and dynamic vsetvli
approach for vector loops to reduce branch overhead. Add conditional
compilation to fall back to scalar implementation when __riscv_vector is
not available.
Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
---
src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
1 file changed, 101 insertions(+)
create mode 100644 src/string/riscv64/memset.S
diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 00000000..5fc6ee14
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,101 @@
+#ifdef __riscv_vector
+
+ .text
+ .global memset
+/* void *memset(void *s, int c, size_t n)
+ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
+ * Returns a0.
+ */
+memset:
+ mv t0, a0 /* running dst; keep a0 as return */
+ beqz a2, .Ldone /* n == 0 → return */
+
+ li t3, 8
+ bltu a2, t3, .Lsmall /* small-size fast path */
+
+ /* Broadcast fill byte once. */
+ vsetvli t1, zero, e8, m8, ta, ma
+ vmv.v.x v0, a1
+
+.Lbulk:
+ vsetvli t1, a2, e8, m8, ta, ma /* t1 = vl (bytes) */
+ vse8.v v0, (t0)
+ add t0, t0, t1
+ sub a2, a2, t1
+ bnez a2, .Lbulk
+ j .Ldone
+
+/* Small-size fast path (< 8).
+ * Head-tail fills to minimize branches and avoid vsetvli overhead.
+ */
+.Lsmall:
+ /* Fill s[0], s[n-1] */
+ sb a1, 0(t0)
+ add t2, t0, a2
+ sb a1, -1(t2)
+ li t3, 2
+ bleu a2, t3, .Ldone
+
+ /* Fill s[1], s[2], s[n-2], s[n-3] */
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(t2)
+ sb a1, -3(t2)
+ li t3, 6
+ bleu a2, t3, .Ldone
+
+ /* Fill s[3], s[n-4] */
+ sb a1, 3(t0)
+ sb a1, -4(t2)
+ /* fallthrough for n <= 8 */
+
+.Ldone:
+ ret
+.size memset, .-memset
+
+#else /* !__riscv_vector */
+
+ .text
+ .global memset
+/* Fallback scalar memset
+ * void *memset(void *s, int c, size_t n)
+ */
+memset:
+ mv t0, a0 /* running dst; keep a0 as return */
+ beqz a2, .Ldone
+
+ andi a1, a1, 0xff /* use low 8 bits only */
+
+ /* Head-tail strategy for small n */
+ sb a1, 0(t0) /* s[0] */
+ add t2, t0, a2
+ sb a1, -1(t2) /* s[n-1] */
+ li t3, 2
+ bleu a2, t3, .Ldone
+
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(t2)
+ sb a1, -3(t2)
+ li t3, 6
+ bleu a2, t3, .Ldone
+
+ sb a1, 3(t0)
+ sb a1, -4(t2)
+ li t3, 8
+ bleu a2, t3, .Ldone
+
+ /* Linear fill middle region [4, n-4) */
+ addi t4, t0, 4
+ addi t5, t2, -4
+.Lloop:
+ bgeu t4, t5, .Ldone
+ sb a1, 0(t4)
+ addi t4, t4, 1
+ j .Lloop
+
+.Ldone:
+ ret
+.size memset, .-memset
+
+#endif
--
2.39.5
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
@ 2025-09-25 15:30 ` Yao Zi
2025-09-26 0:31 ` Pincheng Wang
0 siblings, 1 reply; 6+ messages in thread
From: Yao Zi @ 2025-09-25 15:30 UTC (permalink / raw)
To: musl; +Cc: pincheng.plct
On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote:
> Use head-tail filling strategy for small sizes and dynamic vsetvli
> approach for vector loops to reduce branch overhead. Add conditional
> compilation to fall back to scalar implementation when __riscv_vector is
> not available.
>
> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
> ---
> src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 101 insertions(+)
> create mode 100644 src/string/riscv64/memset.S
>
> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
> new file mode 100644
> index 00000000..5fc6ee14
> --- /dev/null
> +++ b/src/string/riscv64/memset.S
> @@ -0,0 +1,101 @@
> +#ifdef __riscv_vector
I don't think musl is built with V extension specified in march on
RISC-V platforms by default. Does this patch only benefit builds that
"-march=rv64gcv" is manually specified in CFLAGS?
Furthermore, having RVV available at compilation-time doesn't mean it's
available at runtime. This effectively raises the baseline for RISC-V
platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't
implied by the mostly-adapted RVA20 profile.
Best regards,
Yao Zi
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
2025-09-25 15:30 ` Yao Zi
@ 2025-09-26 0:31 ` Pincheng Wang
2025-09-26 3:37 ` Markus Wichmann
0 siblings, 1 reply; 6+ messages in thread
From: Pincheng Wang @ 2025-09-26 0:31 UTC (permalink / raw)
To: musl, Yao Zi
On 2025/9/25 23:30, Yao Zi wrote:
> On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote:
>> Use head-tail filling strategy for small sizes and dynamic vsetvli
>> approach for vector loops to reduce branch overhead. Add conditional
>> compilation to fall back to scalar implementation when __riscv_vector is
>> not available.
>>
>> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
>> ---
>> src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++
>> 1 file changed, 101 insertions(+)
>> create mode 100644 src/string/riscv64/memset.S
>>
>> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
>> new file mode 100644
>> index 00000000..5fc6ee14
>> --- /dev/null
>> +++ b/src/string/riscv64/memset.S
>> @@ -0,0 +1,101 @@
>> +#ifdef __riscv_vector
>
> I don't think musl is built with V extension specified in march on
> RISC-V platforms by default. Does this patch only benefit builds that
> "-march=rv64gcv" is manually specified in CFLAGS?
>
> Furthermore, having RVV available at compilation-time doesn't mean it's
> available at runtime. This effectively raises the baseline for RISC-V
> platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't
> implied by the mostly-adapted RVA20 profile.
>
> Best regards,
> Yao Zi
Hi, Yao
Thank you for your review. This patch currently only takes effect when
`-march=rv64gcv` is manually specified in CFLAGS. I also understand your
concern about enabling the vector implementation purely through
compile-time conditionals.
I am investigating a runtime detection and dispatch mechanism to select
the appropriate implementation based on actual hardware support. If I
make progress on this and verify it works as expected, I will update the
approach in a v2 patch.
Best regards,
Pincheng Wang
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
2025-09-26 0:31 ` Pincheng Wang
@ 2025-09-26 3:37 ` Markus Wichmann
2025-09-26 11:21 ` Pincheng Wang
0 siblings, 1 reply; 6+ messages in thread
From: Markus Wichmann @ 2025-09-26 3:37 UTC (permalink / raw)
To: musl; +Cc: Yao Zi
Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang:
> I am investigating a runtime detection and dispatch mechanism to select the
> appropriate implementation based on actual hardware support. If I make
> progress on this and verify it works as expected, I will update the approach
> in a v2 patch.
>
There seems to be a hwcap flag for ISA_V that might be usable. Not sure
if it is any help for memcpy() though, because memcpy() is a function
GCC can insert calls to at any point, including the dynamic linker
stages 1 and 2, and references to libc.hwcap are invalid there. The only
solution I see is to explicitly switch to the optimized version after it
is possible to do so and use the generic version for starters. But that
would be a bit of a larger change to the code base.
Ciao,
Markus
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension
2025-09-26 3:37 ` Markus Wichmann
@ 2025-09-26 11:21 ` Pincheng Wang
0 siblings, 0 replies; 6+ messages in thread
From: Pincheng Wang @ 2025-09-26 11:21 UTC (permalink / raw)
To: musl, Markus Wichmann; +Cc: Yao Zi
On 2025/9/26 11:37, Markus Wichmann wrote:
> Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang:
>> I am investigating a runtime detection and dispatch mechanism to select the
>> appropriate implementation based on actual hardware support. If I make
>> progress on this and verify it works as expected, I will update the approach
>> in a v2 patch.
>>
>
> There seems to be a hwcap flag for ISA_V that might be usable. Not sure
> if it is any help for memcpy() though, because memcpy() is a function
> GCC can insert calls to at any point, including the dynamic linker
> stages 1 and 2, and references to libc.hwcap are invalid there. The only
> solution I see is to explicitly switch to the optimized version after it
> is possible to do so and use the generic version for starters. But that
> would be a bit of a larger change to the code base.
>
> Ciao,
> Markus
Hi, Markus
Thanks for your suggestion! I do plan to use hwcap to detect the V
extension, similar to implementations on other platforms. Your point
about hwcap being unavailable during the linker's bootstrap stage is
indeed very helpful - I hadn't fully considered that. I'll look into
adopting your suggestion: using a generic implementation initially and
explicitly switching to the optimized version once hwcap becomes
available. However, I can't yet be certain about the exact
implementation details, as this will require some time for coding and
testing.
Moreover, since I haven't found any existing RISC-V examples of such
detection-and-dispatch mechanisms, and given that this approach would
indeed entail significant changes to the dynamic linker, I plan to send
the v2 patch as an RFC after I finish my current work.
Thank you again for your insightful comments and suggestions!
Best regards,
Pincheng Wang
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-09-26 11:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang
2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang
2025-09-25 15:30 ` Yao Zi
2025-09-26 0:31 ` Pincheng Wang
2025-09-26 3:37 ` Markus Wichmann
2025-09-26 11:21 ` Pincheng Wang
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).