* [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove
@ 2023-06-07 10:07 zhangfei
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
To: dalias, musl; +Cc: zhangfei
From: zhangfei <zhangfei@nj.iscas.ac.cn>
Hi,
Currently, the risc-v architecture in the kernel source code uses assembly
implemented memset, memcpy, and memmove. As shown in the link below:
[1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S
[2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S
[3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S
I have modified it to a form that can be compiled in musl. At the same time,
I noticed that aarch64 and x86 in musl have assembly implementations of
these functions, so I hope these patches can be integrated into musl.
memset.S refers to the handling of data volume less than 8 bytes in
musl/src/string/memset.c, and modifies the byte storage to fill head and
tail with minimal branching.
The original memcpy.S in the kernel uses byte-wise copy if src and dst are
not co-aligned.This approach is not efficient enough.Therefore, the patch
linked below was used to optimize the memcpy.S of the kernel.
[4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/
[5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/
memmove.S did not make too many modifications, just made it independent of
the kernel's header files and could be compiled separately in musl.
The testing platform selected RISC-V SiFive U74.I used the code linked below
for performance testing.
[6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/
Compared the performance of C language in musl and assembly implementation,
the test results are as follows:
memset.c in musl:
---------------------
Random memset (bytes/ns):
memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25
Medium memset (bytes/ns):
memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72
Large memset (bytes/ns):
memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39
memset.S:
---------------------
Random memset (bytes/ns):
memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31
Medium memset (bytes/ns):
memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67
Large memset (bytes/ns):
memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39
memcpy.c in musl:
---------------------
Random memcpy (bytes/ns):
memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18
Aligned medium memcpy (bytes/ns):
memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19
Unaligned medium memcpy (bytes/ns):
memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91
Large memcpy (bytes/ns):
memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56
memcpy.S:
---------------------
Random memcpy (bytes/ns):
memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21
Aligned medium memcpy (bytes/ns):
memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27
Unaligned medium memcpy (bytes/ns):
memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16
Large memcpy (bytes/ns):
memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75
memmove.c in musl:
---------------------
Unaligned forwards memmove (bytes/ns):
memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20
Unaligned backwards memmove (bytes/ns):
memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24
memmove.S:
---------------------
Unaligned forwards memmove (bytes/ns):
memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81
Unaligned backwards memmove (bytes/ns):
memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83
It can be seen that the basic instruction implementations of memset, memcpy, and
memmove have better performance improvements compared to the C implementation in
musl. Please review the code.
Thanks,
Zhang Fei
zhangfei (3):
RISC-V: Optimize memset
RISC-V: Optimize memcpy
RISC-V: Optimize memmove
src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++
src/string/riscv64/memcpy.S | 159 ++++++++++++++++++++++++++++++++++++
src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
3 file changed, 610 insertions(+)
create mode 100644 src/string/riscv64/memset.S
create mode 100644 src/string/riscv64/memcpy.S
create mode 100644 src/string/riscv64/memmove.S
^ permalink raw reply [flat|nested] 5+ messages in thread
* [musl] [PATCH 1/3] RISC-V: Optimize memset
2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
@ 2023-06-07 10:07 ` zhangfei
2023-06-07 12:57 ` Rich Felker
2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei
2 siblings, 1 reply; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
To: dalias, musl; +Cc: zhangfei
From: zhangfei <zhangfei@nj.iscas.ac.cn>
This code is based on linux/arch/riscv/lib/memset.S. Removed macro definition and modified
to support RISCV64.
When the amount of data in the source code is less than 16 bytes or after loop tail
processing, byte storage is used. Here we refer to musl/src/string/memset.c, and modify it
to fill head and tail with minimal branching.
Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++
1 file changed, 136 insertions(+)
create mode 100644 src/string/riscv64/memset.S
diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 0000000..f8663d7
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2013 Regents of the University of California
+ */
+
+#define SZREG 8
+#define REG_S sd
+
+.global memset
+.type memset,@function
+memset:
+ move t0, a0 /* Preserve return value */
+
+ /* Defer to byte-oriented fill for small sizes */
+ sltiu a3, a2, 16
+ bnez a3, 4f
+
+ /*
+ * Round to nearest XLEN-aligned address
+ * greater than or equal to start address
+ */
+ addi a3, t0, SZREG-1
+ andi a3, a3, ~(SZREG-1)
+ beq a3, t0, 2f /* Skip if already aligned */
+ /* Handle initial misalignment */
+ sub a4, a3, t0
+1:
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 1b
+ sub a2, a2, a4 /* Update count */
+
+2:
+ andi a1, a1, 0xff
+ slli a3, a1, 8
+ or a1, a3, a1
+ slli a3, a1, 16
+ or a1, a3, a1
+ slli a3, a1, 32
+ or a1, a3, a1
+
+ /* Calculate end address */
+ andi a4, a2, ~(SZREG-1)
+ add a3, t0, a4
+
+ andi a4, a4, 31*SZREG /* Calculate remainder */
+ beqz a4, 3f /* Shortcut if no remainder */
+ neg a4, a4
+ addi a4, a4, 32*SZREG /* Calculate initial offset */
+
+ /* Adjust start address with offset */
+ sub t0, t0, a4
+
+ /* Jump into loop body */
+ /* Assumes 64-bit instruction lengths */
+ la a5, 3f
+ srli a4, a4, 1
+ add a5, a5, a4
+ jr a5
+3:
+ REG_S a1, 0(t0)
+ REG_S a1, SZREG(t0)
+ REG_S a1, 2*SZREG(t0)
+ REG_S a1, 3*SZREG(t0)
+ REG_S a1, 4*SZREG(t0)
+ REG_S a1, 5*SZREG(t0)
+ REG_S a1, 6*SZREG(t0)
+ REG_S a1, 7*SZREG(t0)
+ REG_S a1, 8*SZREG(t0)
+ REG_S a1, 9*SZREG(t0)
+ REG_S a1, 10*SZREG(t0)
+ REG_S a1, 11*SZREG(t0)
+ REG_S a1, 12*SZREG(t0)
+ REG_S a1, 13*SZREG(t0)
+ REG_S a1, 14*SZREG(t0)
+ REG_S a1, 15*SZREG(t0)
+ REG_S a1, 16*SZREG(t0)
+ REG_S a1, 17*SZREG(t0)
+ REG_S a1, 18*SZREG(t0)
+ REG_S a1, 19*SZREG(t0)
+ REG_S a1, 20*SZREG(t0)
+ REG_S a1, 21*SZREG(t0)
+ REG_S a1, 22*SZREG(t0)
+ REG_S a1, 23*SZREG(t0)
+ REG_S a1, 24*SZREG(t0)
+ REG_S a1, 25*SZREG(t0)
+ REG_S a1, 26*SZREG(t0)
+ REG_S a1, 27*SZREG(t0)
+ REG_S a1, 28*SZREG(t0)
+ REG_S a1, 29*SZREG(t0)
+ REG_S a1, 30*SZREG(t0)
+ REG_S a1, 31*SZREG(t0)
+ addi t0, t0, 32*SZREG
+ bltu t0, a3, 3b
+ andi a2, a2, SZREG-1 /* Update count */
+
+4:
+ /* Handle trailing misalignment */
+ beqz a2, 6f
+ add a3, t0, a2
+5:
+ /* Fill head and tail with minimal branching. Each
+ * conditional ensures that all the subsequently used
+ * offsets are well-defined and in the dest region. */
+ sb a1, 0(t0)
+ sb a1, -1(a3)
+ li a4, 2
+ bgeu a4, a2, 6f
+
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(a3)
+ sb a1, -3(a3)
+ li a4, 6
+ bgeu a4, a2, 6f
+
+ sb a1, 3(t0)
+ sb a1, -4(a3)
+ li a4, 8
+ bgeu a4, a2, 6f
+
+ sb a1, 4(t0)
+ sb a1, 5(t0)
+ sb a1, -5(a3)
+ li a4, 11
+ bgeu a4, a2, 6f
+
+ sb a1, 6(t0)
+ sb a1, -6(a3)
+ sb a1, -7(a3)
+ li a4, 14
+ bgeu a4, a2, 6f
+
+ sb a1, 7(t0)
+6:
+ ret
--
2.34.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [musl] [PATCH 2/3] RISC-V: Optimize memcpy
2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
@ 2023-06-07 10:07 ` zhangfei
2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei
2 siblings, 0 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
To: dalias, musl; +Cc: zhangfei
From: zhangfei <zhangfei@nj.iscas.ac.cn>
This code is based on linux/arch/riscv/lib/memcpy.S. Removed macro definition to support
RISCV64.
The original implementation in the kernel uses byte-wise copy if src and dst are not
co-aligned.This approach is not efficient enough.Therefore, the patch linked below has
been used to modify this section.
https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/
The link above has been optimized memcpy for misaligned cases.If they are not co-aligned,
then it will load two adjacent words from src and use shifts to assemble a full machine
word.
Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
src/string/riscv64/memcpy.S | 159 ++++++++++++++++++++++++++++++++++++
1 file changed, 159 insertions(+)
create mode 100644 src/string/riscv64/memcpy.S
diff --git a/src/string/riscv64/memcpy.S b/src/string/riscv64/memcpy.S
new file mode 100644
index 0000000..ee59924
--- /dev/null
+++ b/src/string/riscv64/memcpy.S
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2013 Regents of the University of California
+ */
+
+#define SZREG 8
+#define REG_S sd
+#define REG_L ld
+
+.global memcpy
+.type memcpy,@function
+memcpy:
+ /* Save for return value */
+ mv t6, a0
+
+ /*
+ * Register allocation for code below:
+ * a0 - start of uncopied dst
+ * a1 - start of uncopied src
+ * t0 - end of uncopied dst
+ */
+ add t0, a0, a2
+
+ /*
+ * Use bytewise copy if too small.
+ *
+ * This threshold must be at least 2*SZREG to ensure at least one
+ * wordwise copy is performed. It is chosen to be 16 because it will
+ * save at least 7 iterations of bytewise copy, which pays off the
+ * fixed overhead.
+ */
+ li a3, 16
+ bltu a2, a3, .Lbyte_copy_tail
+
+ /*
+ * Bytewise copy first to align a0 to word boundary.
+ */
+ addi a2, a0, SZREG-1
+ andi a2, a2, ~(SZREG-1)
+ beq a0, a2, 2f
+1:
+ lb a5, 0(a1)
+ addi a1, a1, 1
+ sb a5, 0(a0)
+ addi a0, a0, 1
+ bne a0, a2, 1b
+2:
+
+ /*
+ * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+ * aligned word-wise copy. Otherwise we need to perform misaligned
+ * word-wise copy.
+ */
+ andi a3, a1, SZREG-1
+ bnez a3, .Lmisaligned_word_copy
+
+ /* Unrolled wordwise copy */
+ addi t0, t0, -(16*SZREG-1)
+ bgeu a0, t0, 2f
+1:
+ REG_L a2, 0(a1)
+ REG_L a3, SZREG(a1)
+ REG_L a4, 2*SZREG(a1)
+ REG_L a5, 3*SZREG(a1)
+ REG_L a6, 4*SZREG(a1)
+ REG_L a7, 5*SZREG(a1)
+ REG_L t1, 6*SZREG(a1)
+ REG_L t2, 7*SZREG(a1)
+ REG_L t3, 8*SZREG(a1)
+ REG_L t4, 9*SZREG(a1)
+ REG_L t5, 10*SZREG(a1)
+ REG_S a2, 0(a0)
+ REG_S a3, SZREG(a0)
+ REG_S a4, 2*SZREG(a0)
+ REG_S a5, 3*SZREG(a0)
+ REG_S a6, 4*SZREG(a0)
+ REG_S a7, 5*SZREG(a0)
+ REG_S t1, 6*SZREG(a0)
+ REG_S t2, 7*SZREG(a0)
+ REG_S t3, 8*SZREG(a0)
+ REG_S t4, 9*SZREG(a0)
+ REG_S t5, 10*SZREG(a0)
+ REG_L a2, 11*SZREG(a1)
+ REG_L a3, 12*SZREG(a1)
+ REG_L a4, 13*SZREG(a1)
+ REG_L a5, 14*SZREG(a1)
+ REG_L a6, 15*SZREG(a1)
+ addi a1, a1, 16*SZREG
+ REG_S a2, 11*SZREG(a0)
+ REG_S a3, 12*SZREG(a0)
+ REG_S a4, 13*SZREG(a0)
+ REG_S a5, 14*SZREG(a0)
+ REG_S a6, 15*SZREG(a0)
+ addi a0, a0, 16*SZREG
+ bltu a0, t0, 1b
+2:
+ /* Post-loop increment by 16*SZREG-1 and pre-loop decrement by SZREG-1 */
+ addi t0, t0, 15*SZREG
+
+ /* Wordwise copy */
+ bgeu a0, t0, 2f
+1:
+ REG_L a5, 0(a1)
+ addi a1, a1, SZREG
+ REG_S a5, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b
+2:
+ addi t0, t0, SZREG-1
+
+.Lbyte_copy_tail:
+ /*
+ * Bytewise copy anything left.
+ */
+ beq a0, t0, 2f
+1:
+ lb a5, 0(a1)
+ addi a1, a1, 1
+ sb a5, 0(a0)
+ addi a0, a0, 1
+ bne a0, t0, 1b
+2:
+
+ mv a0, t6
+ ret
+
+.Lmisaligned_word_copy:
+ /*
+ * Misaligned word-wise copy.
+ * For misaligned copy we still perform word-wise copy, but we need to
+ * use the value fetched from the previous iteration and do some shifts.
+ * This is safe because we wouldn't access more words than necessary.
+ */
+
+ /* Calculate shifts */
+ slli t3, a3, 3
+ sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+ /* Load the initial value and align a1 */
+ andi a1, a1, ~(SZREG-1)
+ REG_L a5, 0(a1)
+
+ addi t0, t0, -(SZREG-1)
+ /* At least one iteration will be executed here, no check */
+1:
+ srl a4, a5, t3
+ REG_L a5, SZREG(a1)
+ addi a1, a1, SZREG
+ sll a2, a5, t4
+ or a2, a2, a4
+ REG_S a2, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b
+
+ /* Update pointers to correct value */
+ addi t0, t0, SZREG-1
+ add a1, a1, a3
+
+ j .Lbyte_copy_tail
--
2.34.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [musl] [PATCH 3/3] RISC-V: Optimize memmove
2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
@ 2023-06-07 10:07 ` zhangfei
2 siblings, 0 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
To: dalias, musl; +Cc: zhangfei
From: zhangfei <zhangfei@nj.iscas.ac.cn>
This code is based on linux/arch/riscv/lib/memcpy.S. Removed macro definition to support
RISCV64.
Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
1 file changed, 315 insertions(+)
create mode 100644 src/string/riscv64/memmove.S
diff --git a/src/string/riscv64/memmove.S b/src/string/riscv64/memmove.S
new file mode 100644
index 0000000..41b84e3
--- /dev/null
+++ b/src/string/riscv64/memmove.S
@@ -0,0 +1,315 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
+ */
+
+#define SZREG 8
+#define REG_S sd
+#define REG_L ld
+
+.global memmove
+.type memmove,@function
+memmove:
+ /*
+ * Returns
+ * a0 - dest
+ *
+ * Parameters
+ * a0 - Inclusive first byte of dest
+ * a1 - Inclusive first byte of src
+ * a2 - Length of copy n
+ *
+ * Because the return matches the parameter register a0,
+ * we will not clobber or modify that register.
+ *
+ * Note: This currently only works on little-endian.
+ * To port to big-endian, reverse the direction of shifts
+ * in the 2 misaligned fixup copy loops.
+ */
+
+ /* Return if nothing to do */
+ beq a0, a1, return_from_memmove
+ beqz a2, return_from_memmove
+
+ /*
+ * Register Uses
+ * Forward Copy: a1 - Index counter of src
+ * Reverse Copy: a4 - Index counter of src
+ * Forward Copy: t3 - Index counter of dest
+ * Reverse Copy: t4 - Index counter of dest
+ * Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
+ * Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
+ * Both Copy Modes: t0 - Link / Temporary for load-store
+ * Both Copy Modes: t1 - Temporary for load-store
+ * Both Copy Modes: t2 - Temporary for load-store
+ * Both Copy Modes: a5 - dest to src alignment offset
+ * Both Copy Modes: a6 - Shift ammount
+ * Both Copy Modes: a7 - Inverse Shift ammount
+ * Both Copy Modes: a2 - Alternate breakpoint for unrolled loops
+ */
+
+ /*
+ * Solve for some register values now.
+ * Byte copy does not need t5 or t6.
+ */
+ mv t3, a0
+ add t4, a0, a2
+ add a4, a1, a2
+
+ /*
+ * Byte copy if copying less than (2 * SZREG) bytes. This can
+ * cause problems with the bulk copy implementation and is
+ * small enough not to bother.
+ */
+ andi t0, a2, -(2 * SZREG)
+ beqz t0, byte_copy
+
+ /*
+ * Now solve for t5 and t6.
+ */
+ andi t5, t3, -SZREG
+ andi t6, t4, -SZREG
+ /*
+ * If dest(Register t3) rounded down to the nearest naturally
+ * aligned SZREG address, does not equal dest, then add SZREG
+ * to find the low-bound of SZREG alignment in the dest memory
+ * region. Note that this could overshoot the dest memory
+ * region if n is less than SZREG. This is one reason why
+ * we always byte copy if n is less than SZREG.
+ * Otherwise, dest is already naturally aligned to SZREG.
+ */
+ beq t5, t3, 1f
+ addi t5, t5, SZREG
+ 1:
+
+ /*
+ * If the dest and src are co-aligned to SZREG, then there is
+ * no need for the full rigmarole of a full misaligned fixup copy.
+ * Instead, do a simpler co-aligned copy.
+ */
+ xor t0, a0, a1
+ andi t1, t0, (SZREG - 1)
+ beqz t1, coaligned_copy
+ /* Fall through to misaligned fixup copy */
+
+misaligned_fixup_copy:
+ bltu a1, a0, misaligned_fixup_copy_reverse
+
+misaligned_fixup_copy_forward:
+ jal t0, byte_copy_until_aligned_forward
+
+ andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */
+ slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+ sub a5, a1, t3 /* Find the difference between src and dest */
+ andi a1, a1, -SZREG /* Align the src pointer */
+ addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/
+
+ /*
+ * Compute The Inverse Shift
+ * a7 = XLEN - a6 = XLEN + -a6
+ * 2s complement negation to find the negative: -a6 = ~a6 + 1
+ * Add that to XLEN. XLEN = SZREG * 8.
+ */
+ not a7, a6
+ addi a7, a7, (SZREG * 8 + 1)
+
+ /*
+ * Fix Misalignment Copy Loop - Forward
+ * load_val0 = load_ptr[0];
+ * do {
+ * load_val1 = load_ptr[1];
+ * store_ptr += 2;
+ * store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7});
+ *
+ * if (store_ptr == {a2})
+ * break;
+ *
+ * load_val0 = load_ptr[2];
+ * load_ptr += 2;
+ * store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7});
+ *
+ * } while (store_ptr != store_ptr_end);
+ * store_ptr = store_ptr_end;
+ */
+
+ REG_L t0, (0 * SZREG)(a1)
+ 1:
+ REG_L t1, (1 * SZREG)(a1)
+ addi t3, t3, (2 * SZREG)
+ srl t0, t0, a6
+ sll t2, t1, a7
+ or t2, t0, t2
+ REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3)
+
+ beq t3, a2, 2f
+
+ REG_L t0, (2 * SZREG)(a1)
+ addi a1, a1, (2 * SZREG)
+ srl t1, t1, a6
+ sll t2, t0, a7
+ or t2, t1, t2
+ REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3)
+
+ bne t3, t6, 1b
+ 2:
+ mv t3, t6 /* Fix the dest pointer in case the loop was broken */
+
+ add a1, t3, a5 /* Restore the src pointer */
+ j byte_copy_forward /* Copy any remaining bytes */
+
+misaligned_fixup_copy_reverse:
+ jal t0, byte_copy_until_aligned_reverse
+
+ andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
+ slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+ sub a5, a4, t4 /* Find the difference between src and dest */
+ andi a4, a4, -SZREG /* Align the src pointer */
+ addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/
+
+ /*
+ * Compute The Inverse Shift
+ * a7 = XLEN - a6 = XLEN + -a6
+ * 2s complement negation to find the negative: -a6 = ~a6 + 1
+ * Add that to XLEN. XLEN = SZREG * 8.
+ */
+ not a7, a6
+ addi a7, a7, (SZREG * 8 + 1)
+
+ /*
+ * Fix Misalignment Copy Loop - Reverse
+ * load_val1 = load_ptr[0];
+ * do {
+ * load_val0 = load_ptr[-1];
+ * store_ptr -= 2;
+ * store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7});
+ *
+ * if (store_ptr == {a2})
+ * break;
+ *
+ * load_val1 = load_ptr[-2];
+ * load_ptr -= 2;
+ * store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7});
+ *
+ * } while (store_ptr != store_ptr_end);
+ * store_ptr = store_ptr_end;
+ */
+
+ REG_L t1, ( 0 * SZREG)(a4)
+ 1:
+ REG_L t0, (-1 * SZREG)(a4)
+ addi t4, t4, (-2 * SZREG)
+ sll t1, t1, a7
+ srl t2, t0, a6
+ or t2, t1, t2
+ REG_S t2, ( 1 * SZREG)(t4)
+
+ beq t4, a2, 2f
+
+ REG_L t1, (-2 * SZREG)(a4)
+ addi a4, a4, (-2 * SZREG)
+ sll t0, t0, a7
+ srl t2, t1, a6
+ or t2, t0, t2
+ REG_S t2, ( 0 * SZREG)(t4)
+
+ bne t4, t5, 1b
+ 2:
+ mv t4, t5 /* Fix the dest pointer in case the loop was broken */
+
+ add a4, t4, a5 /* Restore the src pointer */
+ j byte_copy_reverse /* Copy any remaining bytes */
+
+/*
+ * Simple copy loops for SZREG co-aligned memory locations.
+ * These also make calls to do byte copies for any unaligned
+ * data at their terminations.
+ */
+coaligned_copy:
+ bltu a1, a0, coaligned_copy_reverse
+
+coaligned_copy_forward:
+ jal t0, byte_copy_until_aligned_forward
+
+ 1:
+ REG_L t1, ( 0 * SZREG)(a1)
+ addi a1, a1, SZREG
+ addi t3, t3, SZREG
+ REG_S t1, (-1 * SZREG)(t3)
+ bne t3, t6, 1b
+
+ j byte_copy_forward /* Copy any remaining bytes */
+
+coaligned_copy_reverse:
+ jal t0, byte_copy_until_aligned_reverse
+
+ 1:
+ REG_L t1, (-1 * SZREG)(a4)
+ addi a4, a4, -SZREG
+ addi t4, t4, -SZREG
+ REG_S t1, ( 0 * SZREG)(t4)
+ bne t4, t5, 1b
+
+ j byte_copy_reverse /* Copy any remaining bytes */
+
+/*
+ * These are basically sub-functions within the function. They
+ * are used to byte copy until the dest pointer is in alignment.
+ * At which point, a bulk copy method can be used by the
+ * calling code. These work on the same registers as the bulk
+ * copy loops. Therefore, the register values can be picked
+ * up from where they were left and we avoid code duplication
+ * without any overhead except the call in and return jumps.
+ */
+byte_copy_until_aligned_forward:
+ beq t3, t5, 2f
+ 1:
+ lb t1, 0(a1)
+ addi a1, a1, 1
+ addi t3, t3, 1
+ sb t1, -1(t3)
+ bne t3, t5, 1b
+ 2:
+ jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+byte_copy_until_aligned_reverse:
+ beq t4, t6, 2f
+ 1:
+ lb t1, -1(a4)
+ addi a4, a4, -1
+ addi t4, t4, -1
+ sb t1, 0(t4)
+ bne t4, t6, 1b
+ 2:
+ jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+/*
+ * Simple byte copy loops.
+ * These will byte copy until they reach the end of data to copy.
+ * At that point, they will call to return from memmove.
+ */
+byte_copy:
+ bltu a1, a0, byte_copy_reverse
+
+byte_copy_forward:
+ beq t3, t4, 2f
+ 1:
+ lb t1, 0(a1)
+ addi a1, a1, 1
+ addi t3, t3, 1
+ sb t1, -1(t3)
+ bne t3, t4, 1b
+ 2:
+ ret
+
+byte_copy_reverse:
+ beq t4, t3, 2f
+ 1:
+ lb t1, -1(a4)
+ addi a4, a4, -1
+ addi t4, t4, -1
+ sb t1, 0(t4)
+ bne t4, t3, 1b
+ 2:
+
+return_from_memmove:
+ ret
--
2.34.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [musl] [PATCH 1/3] RISC-V: Optimize memset
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
@ 2023-06-07 12:57 ` Rich Felker
0 siblings, 0 replies; 5+ messages in thread
From: Rich Felker @ 2023-06-07 12:57 UTC (permalink / raw)
To: zhangfei; +Cc: musl, zhangfei
On Wed, Jun 07, 2023 at 06:07:08PM +0800, zhangfei wrote:
> From: zhangfei <zhangfei@nj.iscas.ac.cn>
>
> This code is based on linux/arch/riscv/lib/memset.S. Removed macro definition and modified
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> to support RISCV64.
> When the amount of data in the source code is less than 16 bytes or after loop tail
> processing, byte storage is used. Here we refer to musl/src/string/memset.c, and modify it
> to fill head and tail with minimal branching.
>
> Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
> ---
> src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 136 insertions(+)
> create mode 100644 src/string/riscv64/memset.S
>
> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
> new file mode 100644
> index 0000000..f8663d7
> --- /dev/null
> +++ b/src/string/riscv64/memset.S
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
^^^^^^^^^^^^
This completely precludes any consideration for inclusion. Please do
not send license-incompatible code to the mailing list. Not only can
we not use it, but putting it in front of people actually working on
code suitable for musl makes us work extra hard to avoid taint.
You're free to link it into your own products (assuming you're
honoring your obligations under the GPL...), and doing so will get you
pretty much the entire benefit of having had this in libc.
Rich
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-06-07 12:57 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
2023-06-07 12:57 ` Rich Felker
2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).