* [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation @ 2025-09-25 13:15 Pincheng Wang 2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang 0 siblings, 1 reply; 6+ messages in thread From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw) To: musl; +Cc: pincheng.plct Hi all, This patch introduces a RISC-V Vector (RVV) optimized implementation of memset. Key points: - Use RVV instructions to fill memory in bulk, with a small-size head-tail fast path to reduce vsetvli overhead. - Fall back to a scalar head-tail implementation (like generic C implementation) when RVV is not available. - Reduce both instruction count and code size: memset.o shrinks by about 16.5% compared to the generic C build. Performance results on RVV-capable hardware show clear improvements: - On Spacemit X60: up to ~3.1x faster (256B), with consistent gains across medium and large sizes. - On XuanTie C908: up to ~2.1x faster (128B), with modest gains for larger sizes. For very small sizes (<8 Bytes), there can be regressions compared to the generic C version. A more aggresive fast path could remove these regressions, but at the cost of added code complexity. Feedback on this trade-off is welcome. The implementation was tested under QEMU with RVV enabled and on real hardware. Functional behavior matches the generic memset, with no changes to the public interface. Thanks, Pincheng Wang Pincheng Wang (1): riscv64: optimize memset implementation with vector extension src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 src/string/riscv64/memset.S -- 2.39.5 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension 2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang @ 2025-09-25 13:15 ` Pincheng Wang 2025-09-25 15:30 ` Yao Zi 0 siblings, 1 reply; 6+ messages in thread From: Pincheng Wang @ 2025-09-25 13:15 UTC (permalink / raw) To: musl; +Cc: pincheng.plct Use head-tail filling strategy for small sizes and dynamic vsetvli approach for vector loops to reduce branch overhead. Add conditional compilation to fall back to scalar implementation when __riscv_vector is not available. Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn> --- src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 src/string/riscv64/memset.S diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S new file mode 100644 index 00000000..5fc6ee14 --- /dev/null +++ b/src/string/riscv64/memset.S @@ -0,0 +1,101 @@ +#ifdef __riscv_vector + + .text + .global memset +/* void *memset(void *s, int c, size_t n) + * a0 = s (dest), a1 = c (fill byte), a2 = n (size) + * Returns a0. + */ +memset: + mv t0, a0 /* running dst; keep a0 as return */ + beqz a2, .Ldone /* n == 0 → return */ + + li t3, 8 + bltu a2, t3, .Lsmall /* small-size fast path */ + + /* Broadcast fill byte once. */ + vsetvli t1, zero, e8, m8, ta, ma + vmv.v.x v0, a1 + +.Lbulk: + vsetvli t1, a2, e8, m8, ta, ma /* t1 = vl (bytes) */ + vse8.v v0, (t0) + add t0, t0, t1 + sub a2, a2, t1 + bnez a2, .Lbulk + j .Ldone + +/* Small-size fast path (< 8). + * Head-tail fills to minimize branches and avoid vsetvli overhead. + */ +.Lsmall: + /* Fill s[0], s[n-1] */ + sb a1, 0(t0) + add t2, t0, a2 + sb a1, -1(t2) + li t3, 2 + bleu a2, t3, .Ldone + + /* Fill s[1], s[2], s[n-2], s[n-3] */ + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(t2) + sb a1, -3(t2) + li t3, 6 + bleu a2, t3, .Ldone + + /* Fill s[3], s[n-4] */ + sb a1, 3(t0) + sb a1, -4(t2) + /* fallthrough for n <= 8 */ + +.Ldone: + ret +.size memset, .-memset + +#else /* !__riscv_vector */ + + .text + .global memset +/* Fallback scalar memset + * void *memset(void *s, int c, size_t n) + */ +memset: + mv t0, a0 /* running dst; keep a0 as return */ + beqz a2, .Ldone + + andi a1, a1, 0xff /* use low 8 bits only */ + + /* Head-tail strategy for small n */ + sb a1, 0(t0) /* s[0] */ + add t2, t0, a2 + sb a1, -1(t2) /* s[n-1] */ + li t3, 2 + bleu a2, t3, .Ldone + + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(t2) + sb a1, -3(t2) + li t3, 6 + bleu a2, t3, .Ldone + + sb a1, 3(t0) + sb a1, -4(t2) + li t3, 8 + bleu a2, t3, .Ldone + + /* Linear fill middle region [4, n-4) */ + addi t4, t0, 4 + addi t5, t2, -4 +.Lloop: + bgeu t4, t5, .Ldone + sb a1, 0(t4) + addi t4, t4, 1 + j .Lloop + +.Ldone: + ret +.size memset, .-memset + +#endif -- 2.39.5 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension 2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang @ 2025-09-25 15:30 ` Yao Zi 2025-09-26 0:31 ` Pincheng Wang 0 siblings, 1 reply; 6+ messages in thread From: Yao Zi @ 2025-09-25 15:30 UTC (permalink / raw) To: musl; +Cc: pincheng.plct On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote: > Use head-tail filling strategy for small sizes and dynamic vsetvli > approach for vector loops to reduce branch overhead. Add conditional > compilation to fall back to scalar implementation when __riscv_vector is > not available. > > Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn> > --- > src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 101 insertions(+) > create mode 100644 src/string/riscv64/memset.S > > diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S > new file mode 100644 > index 00000000..5fc6ee14 > --- /dev/null > +++ b/src/string/riscv64/memset.S > @@ -0,0 +1,101 @@ > +#ifdef __riscv_vector I don't think musl is built with V extension specified in march on RISC-V platforms by default. Does this patch only benefit builds that "-march=rv64gcv" is manually specified in CFLAGS? Furthermore, having RVV available at compilation-time doesn't mean it's available at runtime. This effectively raises the baseline for RISC-V platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't implied by the mostly-adapted RVA20 profile. Best regards, Yao Zi ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension 2025-09-25 15:30 ` Yao Zi @ 2025-09-26 0:31 ` Pincheng Wang 2025-09-26 3:37 ` Markus Wichmann 0 siblings, 1 reply; 6+ messages in thread From: Pincheng Wang @ 2025-09-26 0:31 UTC (permalink / raw) To: musl, Yao Zi On 2025/9/25 23:30, Yao Zi wrote: > On Thu, Sep 25, 2025 at 09:15:57PM +0800, Pincheng Wang wrote: >> Use head-tail filling strategy for small sizes and dynamic vsetvli >> approach for vector loops to reduce branch overhead. Add conditional >> compilation to fall back to scalar implementation when __riscv_vector is >> not available. >> >> Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn> >> --- >> src/string/riscv64/memset.S | 101 ++++++++++++++++++++++++++++++++++++ >> 1 file changed, 101 insertions(+) >> create mode 100644 src/string/riscv64/memset.S >> >> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S >> new file mode 100644 >> index 00000000..5fc6ee14 >> --- /dev/null >> +++ b/src/string/riscv64/memset.S >> @@ -0,0 +1,101 @@ >> +#ifdef __riscv_vector > > I don't think musl is built with V extension specified in march on > RISC-V platforms by default. Does this patch only benefit builds that > "-march=rv64gcv" is manually specified in CFLAGS? > > Furthermore, having RVV available at compilation-time doesn't mean it's > available at runtime. This effectively raises the baseline for RISC-V > platforms from RV64GC (or even lower) to RV64GCV, where the latter isn't > implied by the mostly-adapted RVA20 profile. > > Best regards, > Yao Zi Hi, Yao Thank you for your review. This patch currently only takes effect when `-march=rv64gcv` is manually specified in CFLAGS. I also understand your concern about enabling the vector implementation purely through compile-time conditionals. I am investigating a runtime detection and dispatch mechanism to select the appropriate implementation based on actual hardware support. If I make progress on this and verify it works as expected, I will update the approach in a v2 patch. Best regards, Pincheng Wang ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension 2025-09-26 0:31 ` Pincheng Wang @ 2025-09-26 3:37 ` Markus Wichmann 2025-09-26 11:21 ` Pincheng Wang 0 siblings, 1 reply; 6+ messages in thread From: Markus Wichmann @ 2025-09-26 3:37 UTC (permalink / raw) To: musl; +Cc: Yao Zi Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang: > I am investigating a runtime detection and dispatch mechanism to select the > appropriate implementation based on actual hardware support. If I make > progress on this and verify it works as expected, I will update the approach > in a v2 patch. > There seems to be a hwcap flag for ISA_V that might be usable. Not sure if it is any help for memcpy() though, because memcpy() is a function GCC can insert calls to at any point, including the dynamic linker stages 1 and 2, and references to libc.hwcap are invalid there. The only solution I see is to explicitly switch to the optimized version after it is possible to do so and use the generic version for starters. But that would be a bit of a larger change to the code base. Ciao, Markus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension 2025-09-26 3:37 ` Markus Wichmann @ 2025-09-26 11:21 ` Pincheng Wang 0 siblings, 0 replies; 6+ messages in thread From: Pincheng Wang @ 2025-09-26 11:21 UTC (permalink / raw) To: musl, Markus Wichmann; +Cc: Yao Zi On 2025/9/26 11:37, Markus Wichmann wrote: > Am Fri, Sep 26, 2025 at 08:31:53AM +0800 schrieb Pincheng Wang: >> I am investigating a runtime detection and dispatch mechanism to select the >> appropriate implementation based on actual hardware support. If I make >> progress on this and verify it works as expected, I will update the approach >> in a v2 patch. >> > > There seems to be a hwcap flag for ISA_V that might be usable. Not sure > if it is any help for memcpy() though, because memcpy() is a function > GCC can insert calls to at any point, including the dynamic linker > stages 1 and 2, and references to libc.hwcap are invalid there. The only > solution I see is to explicitly switch to the optimized version after it > is possible to do so and use the generic version for starters. But that > would be a bit of a larger change to the code base. > > Ciao, > Markus Hi, Markus Thanks for your suggestion! I do plan to use hwcap to detect the V extension, similar to implementations on other platforms. Your point about hwcap being unavailable during the linker's bootstrap stage is indeed very helpful - I hadn't fully considered that. I'll look into adopting your suggestion: using a generic implementation initially and explicitly switching to the optimized version once hwcap becomes available. However, I can't yet be certain about the exact implementation details, as this will require some time for coding and testing. Moreover, since I haven't found any existing RISC-V examples of such detection-and-dispatch mechanisms, and given that this approach would indeed entail significant changes to the dynamic linker, I plan to send the v2 patch as an RFC after I finish my current work. Thank you again for your insightful comments and suggestions! Best regards, Pincheng Wang ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-09-26 11:22 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-09-25 13:15 [musl] [PATCH 0/1] riscv64: Add RVV optimized memset implementation Pincheng Wang 2025-09-25 13:15 ` [musl] [PATCH 1/1] riscv64: optimize memset implementation with vector extension Pincheng Wang 2025-09-25 15:30 ` Yao Zi 2025-09-26 0:31 ` Pincheng Wang 2025-09-26 3:37 ` Markus Wichmann 2025-09-26 11:21 ` Pincheng Wang
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).