mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove
@ 2023-06-07 10:07 zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
  To: dalias, musl; +Cc: zhangfei

From: zhangfei <zhangfei@nj.iscas.ac.cn>

Hi,

Currently, the risc-v architecture in the kernel source code uses assembly 
implemented memset, memcpy, and memmove. As shown in the link below:

[1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S
[2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S
[3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S

I have modified it to a form that can be compiled in musl. At the same time, 
I noticed that aarch64 and x86 in musl have assembly implementations of 
these functions, so I hope these patches can be integrated into musl.

memset.S refers to the handling of data volume less than 8 bytes in 
musl/src/string/memset.c, and modifies the byte storage to fill head and 
tail with minimal branching.

The original memcpy.S in the kernel uses byte-wise copy if src and dst are 
not co-aligned.This approach is not efficient enough.Therefore, the patch 
linked below was used to optimize the memcpy.S of the kernel.

[4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/
[5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/

memmove.S did not make too many modifications, just made it independent of 
the kernel's header files and could be compiled separately in musl.

The testing platform selected RISC-V SiFive U74.I used the code linked below 
for performance testing.

[6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/

Compared the performance of C language in musl and assembly implementation, 
the test results are as follows:

memset.c in musl:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25

Medium memset (bytes/ns):
           memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72
Large memset (bytes/ns):
           memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39

memset.S:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31

Medium memset (bytes/ns):
           memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67
Large memset (bytes/ns):
           memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39


memcpy.c in musl:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91

Large memcpy (bytes/ns):
           memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56

memcpy.S:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16

Large memcpy (bytes/ns):
           memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75


memmove.c in musl:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24

memmove.S:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83

It can be seen that the basic instruction implementations of memset, memcpy, and
memmove have better performance improvements compared to the C implementation in 
musl. Please review the code.

Thanks,
Zhang Fei

zhangfei (3):
  RISC-V: Optimize memset
  RISC-V: Optimize memcpy
  RISC-V: Optimize memmove

 src/string/riscv64/memset.S  | 136 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memcpy.S  | 159 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
 3 file changed, 610 insertions(+)
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memcpy.S
 create mode 100644 src/string/riscv64/memmove.S


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [musl] [PATCH 1/3] RISC-V: Optimize memset
  2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
@ 2023-06-07 10:07 ` zhangfei
  2023-06-07 12:57   ` Rich Felker
  2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei
  2 siblings, 1 reply; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
  To: dalias, musl; +Cc: zhangfei

From: zhangfei <zhangfei@nj.iscas.ac.cn>

This code is based on linux/arch/riscv/lib/memset.S. Removed macro definition and modified
to support RISCV64.
When the amount of data in the source code is less than 16 bytes or after loop tail
processing, byte storage is used. Here we refer to musl/src/string/memset.c, and modify it
to fill head and tail with minimal branching.

Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
 src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)
 create mode 100644 src/string/riscv64/memset.S

diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 0000000..f8663d7
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2013 Regents of the University of California
+ */
+
+#define SZREG 8
+#define REG_S sd
+
+.global memset
+.type memset,@function
+memset:
+        move 	t0, a0  /* Preserve return value */
+
+	/* Defer to byte-oriented fill for small sizes */
+	sltiu 	a3, a2, 16
+	bnez 	a3, 4f
+
+	/*
+	 * Round to nearest XLEN-aligned address
+	 * greater than or equal to start address
+	 */
+	addi 	a3, t0, SZREG-1
+	andi 	a3, a3, ~(SZREG-1)
+	beq 	a3, t0, 2f  /* Skip if already aligned */
+	/* Handle initial misalignment */
+	sub 	a4, a3, t0
+1:
+	sb 	a1, 0(t0)
+	addi 	t0, t0, 1
+	bltu 	t0, a3, 1b
+	sub 	a2, a2, a4  /* Update count */
+
+2: 
+	andi 	a1, a1, 0xff
+	slli 	a3, a1, 8
+	or 	a1, a3, a1
+	slli 	a3, a1, 16
+	or 	a1, a3, a1
+	slli 	a3, a1, 32
+	or 	a1, a3, a1
+
+	/* Calculate end address */
+	andi 	a4, a2, ~(SZREG-1)
+	add 	a3, t0, a4
+
+	andi 	a4, a4, 31*SZREG  /* Calculate remainder */
+	beqz 	a4, 3f            /* Shortcut if no remainder */
+	neg 	a4, a4
+	addi 	a4, a4, 32*SZREG  /* Calculate initial offset */
+
+	/* Adjust start address with offset */
+	sub 	t0, t0, a4
+
+	/* Jump into loop body */
+	/* Assumes 64-bit instruction lengths */
+	la 	a5, 3f
+	srli 	a4, a4, 1
+	add 	a5, a5, a4
+	jr 	a5
+3:
+	REG_S 	a1,        0(t0)
+	REG_S 	a1,    SZREG(t0)
+	REG_S 	a1,  2*SZREG(t0)
+	REG_S 	a1,  3*SZREG(t0)
+	REG_S 	a1,  4*SZREG(t0)
+	REG_S 	a1,  5*SZREG(t0)
+	REG_S 	a1,  6*SZREG(t0)
+	REG_S 	a1,  7*SZREG(t0)
+	REG_S 	a1,  8*SZREG(t0)
+	REG_S 	a1,  9*SZREG(t0)
+	REG_S 	a1, 10*SZREG(t0)
+	REG_S 	a1, 11*SZREG(t0)
+	REG_S 	a1, 12*SZREG(t0)
+	REG_S 	a1, 13*SZREG(t0)
+	REG_S 	a1, 14*SZREG(t0)
+	REG_S 	a1, 15*SZREG(t0)
+	REG_S 	a1, 16*SZREG(t0)
+	REG_S 	a1, 17*SZREG(t0)
+	REG_S 	a1, 18*SZREG(t0)
+	REG_S 	a1, 19*SZREG(t0)
+	REG_S 	a1, 20*SZREG(t0)
+	REG_S 	a1, 21*SZREG(t0)
+	REG_S 	a1, 22*SZREG(t0)
+	REG_S 	a1, 23*SZREG(t0)
+	REG_S 	a1, 24*SZREG(t0)
+	REG_S 	a1, 25*SZREG(t0)
+	REG_S 	a1, 26*SZREG(t0)
+	REG_S 	a1, 27*SZREG(t0)
+	REG_S 	a1, 28*SZREG(t0)
+	REG_S 	a1, 29*SZREG(t0)
+	REG_S 	a1, 30*SZREG(t0)
+	REG_S 	a1, 31*SZREG(t0)
+	addi 	t0, t0, 32*SZREG
+	bltu 	t0, a3, 3b
+	andi 	a2, a2, SZREG-1  /* Update count */
+
+4:
+	/* Handle trailing misalignment */
+	beqz 	a2, 6f
+	add 	a3, t0, a2
+5:
+        /* Fill head and tail with minimal branching. Each
+         * conditional ensures that all the subsequently used
+         * offsets are well-defined and in the dest region. */
+	sb 	a1, 0(t0)
+	sb 	a1, -1(a3)
+	li 	a4, 2
+       bgeu 	a4, a2, 6f 
+        
+       sb 	a1, 1(t0) 
+       sb 	a1, 2(t0) 
+       sb 	a1, -2(a3) 
+       sb 	a1, -3(a3) 
+	li 	a4, 6
+       bgeu 	a4, a2, 6f 
+
+       sb 	a1, 3(t0) 
+       sb 	a1, -4(a3) 
+	li 	a4, 8
+       bgeu 	a4, a2, 6f 
+        
+       sb 	a1, 4(t0) 
+       sb 	a1, 5(t0) 
+       sb 	a1, -5(a3) 
+	li 	a4, 11
+       bgeu 	a4, a2, 6f 
+ 
+       sb 	a1, 6(t0) 
+       sb 	a1, -6(a3) 
+       sb 	a1, -7(a3) 
+	li 	a4, 14
+       bgeu 	a4, a2, 6f 
+
+       sb 	a1, 7(t0) 
+6:
+	ret
-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [musl] [PATCH 2/3] RISC-V: Optimize memcpy
  2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
@ 2023-06-07 10:07 ` zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei
  2 siblings, 0 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
  To: dalias, musl; +Cc: zhangfei

From: zhangfei <zhangfei@nj.iscas.ac.cn>

This code is based on linux/arch/riscv/lib/memcpy.S. Removed macro definition to support
RISCV64.
The original implementation in the kernel uses byte-wise copy if src and dst are not
co-aligned.This approach is not efficient enough.Therefore, the patch linked below has
been used to modify this section.

https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/ 

The link above has been optimized memcpy for misaligned cases.If they are not co-aligned,
then it will load two adjacent words from src and use shifts to assemble a full machine 
word.

Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
 src/string/riscv64/memcpy.S | 159 ++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)
 create mode 100644 src/string/riscv64/memcpy.S

diff --git a/src/string/riscv64/memcpy.S b/src/string/riscv64/memcpy.S
new file mode 100644
index 0000000..ee59924
--- /dev/null
+++ b/src/string/riscv64/memcpy.S
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2013 Regents of the University of California
+ */
+
+#define SZREG 8
+#define REG_S sd
+#define REG_L ld
+
+.global memcpy
+.type memcpy,@function
+memcpy:
+        /* Save for return value */
+        mv      t6, a0
+
+        /*
+         * Register allocation for code below:
+         * a0 - start of uncopied dst
+         * a1 - start of uncopied src
+         * t0 - end of uncopied dst
+         */
+        add     t0, a0, a2
+
+        /*
+         * Use bytewise copy if too small.
+         *
+         * This threshold must be at least 2*SZREG to ensure at least one
+         * wordwise copy is performed. It is chosen to be 16 because it will
+         * save at least 7 iterations of bytewise copy, which pays off the
+         * fixed overhead.
+         */
+        li      a3, 16
+        bltu    a2, a3, .Lbyte_copy_tail
+
+        /*
+         * Bytewise copy first to align a0 to word boundary.
+         */
+        addi    a2, a0, SZREG-1
+        andi    a2, a2, ~(SZREG-1)
+        beq     a0, a2, 2f
+1:
+        lb      a5, 0(a1)
+        addi    a1, a1, 1
+        sb      a5, 0(a0)
+        addi    a0, a0, 1
+        bne     a0, a2, 1b
+2:
+
+        /*
+         * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+         * aligned word-wise copy. Otherwise we need to perform misaligned
+         * word-wise copy.
+         */
+        andi    a3, a1, SZREG-1
+        bnez    a3, .Lmisaligned_word_copy
+
+        /* Unrolled wordwise copy */
+        addi    t0, t0, -(16*SZREG-1)
+        bgeu    a0, t0, 2f
+1:
+        REG_L   a2,        0(a1)
+        REG_L   a3,    SZREG(a1)
+        REG_L   a4,  2*SZREG(a1)
+        REG_L   a5,  3*SZREG(a1)
+        REG_L   a6,  4*SZREG(a1)
+        REG_L   a7,  5*SZREG(a1)
+        REG_L   t1,  6*SZREG(a1)
+        REG_L   t2,  7*SZREG(a1)
+        REG_L   t3,  8*SZREG(a1)
+        REG_L   t4,  9*SZREG(a1)
+        REG_L   t5, 10*SZREG(a1)
+        REG_S   a2,        0(a0)
+        REG_S   a3,    SZREG(a0)
+        REG_S   a4,  2*SZREG(a0)
+        REG_S   a5,  3*SZREG(a0)
+        REG_S   a6,  4*SZREG(a0)
+        REG_S   a7,  5*SZREG(a0)
+        REG_S   t1,  6*SZREG(a0)
+        REG_S   t2,  7*SZREG(a0)
+        REG_S   t3,  8*SZREG(a0)
+        REG_S   t4,  9*SZREG(a0)
+        REG_S   t5, 10*SZREG(a0)
+        REG_L   a2, 11*SZREG(a1)
+        REG_L   a3, 12*SZREG(a1)
+        REG_L   a4, 13*SZREG(a1)
+        REG_L   a5, 14*SZREG(a1)
+        REG_L   a6, 15*SZREG(a1)
+        addi    a1, a1, 16*SZREG
+        REG_S   a2, 11*SZREG(a0)
+        REG_S   a3, 12*SZREG(a0)
+        REG_S   a4, 13*SZREG(a0)
+        REG_S   a5, 14*SZREG(a0)
+        REG_S   a6, 15*SZREG(a0)
+        addi    a0, a0, 16*SZREG
+        bltu    a0, t0, 1b
+2:
+        /* Post-loop increment by 16*SZREG-1 and pre-loop decrement by SZREG-1 */
+        addi    t0, t0, 15*SZREG
+
+        /* Wordwise copy */
+        bgeu    a0, t0, 2f
+1:
+        REG_L   a5, 0(a1)
+        addi    a1, a1, SZREG
+        REG_S   a5, 0(a0)
+        addi    a0, a0, SZREG
+        bltu    a0, t0, 1b
+2:
+        addi    t0, t0, SZREG-1
+
+.Lbyte_copy_tail:
+        /*
+         * Bytewise copy anything left.
+         */
+        beq     a0, t0, 2f
+1:
+        lb      a5, 0(a1)
+        addi    a1, a1, 1
+        sb      a5, 0(a0)
+        addi    a0, a0, 1
+        bne     a0, t0, 1b
+2:
+
+        mv      a0, t6
+        ret
+
+.Lmisaligned_word_copy:
+        /*
+         * Misaligned word-wise copy.
+         * For misaligned copy we still perform word-wise copy, but we need to
+         * use the value fetched from the previous iteration and do some shifts.
+         * This is safe because we wouldn't access more words than necessary.
+         */
+
+        /* Calculate shifts */
+        slli    t3, a3, 3
+        sub     t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+        /* Load the initial value and align a1 */
+        andi    a1, a1, ~(SZREG-1)
+        REG_L   a5, 0(a1)
+
+        addi    t0, t0, -(SZREG-1)
+        /* At least one iteration will be executed here, no check */
+1:
+        srl     a4, a5, t3
+        REG_L   a5, SZREG(a1)
+        addi    a1, a1, SZREG
+        sll     a2, a5, t4
+        or      a2, a2, a4
+        REG_S   a2, 0(a0)
+        addi    a0, a0, SZREG
+        bltu    a0, t0, 1b
+
+        /* Update pointers to correct value */
+        addi    t0, t0, SZREG-1
+        add     a1, a1, a3
+
+        j       .Lbyte_copy_tail
-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [musl] [PATCH 3/3] RISC-V: Optimize memmove
  2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
  2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
@ 2023-06-07 10:07 ` zhangfei
  2 siblings, 0 replies; 5+ messages in thread
From: zhangfei @ 2023-06-07 10:07 UTC (permalink / raw)
  To: dalias, musl; +Cc: zhangfei

From: zhangfei <zhangfei@nj.iscas.ac.cn>

This code is based on linux/arch/riscv/lib/memcpy.S. Removed macro definition to support
RISCV64.

Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
---
 src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
 1 file changed, 315 insertions(+)
 create mode 100644 src/string/riscv64/memmove.S

diff --git a/src/string/riscv64/memmove.S b/src/string/riscv64/memmove.S
new file mode 100644
index 0000000..41b84e3
--- /dev/null
+++ b/src/string/riscv64/memmove.S
@@ -0,0 +1,315 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
+ */
+
+#define SZREG 8
+#define REG_S sd
+#define REG_L ld
+
+.global memmove
+.type memmove,@function
+memmove:
+	/*
+	 * Returns
+	 *   a0 - dest
+	 *
+	 * Parameters
+	 *   a0 - Inclusive first byte of dest
+	 *   a1 - Inclusive first byte of src
+	 *   a2 - Length of copy n
+	 *
+	 * Because the return matches the parameter register a0,
+	 * we will not clobber or modify that register.
+	 *
+	 * Note: This currently only works on little-endian.
+	 * To port to big-endian, reverse the direction of shifts
+	 * in the 2 misaligned fixup copy loops.
+	 */
+
+	/* Return if nothing to do */
+	beq a0, a1, return_from_memmove
+	beqz a2, return_from_memmove
+
+	/*
+	 * Register Uses
+	 *      Forward Copy: a1 - Index counter of src
+	 *      Reverse Copy: a4 - Index counter of src
+	 *      Forward Copy: t3 - Index counter of dest
+	 *      Reverse Copy: t4 - Index counter of dest
+	 *   Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
+	 *   Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
+	 *   Both Copy Modes: t0 - Link / Temporary for load-store
+	 *   Both Copy Modes: t1 - Temporary for load-store
+	 *   Both Copy Modes: t2 - Temporary for load-store
+	 *   Both Copy Modes: a5 - dest to src alignment offset
+	 *   Both Copy Modes: a6 - Shift ammount
+	 *   Both Copy Modes: a7 - Inverse Shift ammount
+	 *   Both Copy Modes: a2 - Alternate breakpoint for unrolled loops
+	 */
+
+	/*
+	 * Solve for some register values now.
+	 * Byte copy does not need t5 or t6.
+	 */
+	mv   t3, a0
+	add  t4, a0, a2
+	add  a4, a1, a2
+
+	/*
+	 * Byte copy if copying less than (2 * SZREG) bytes. This can
+	 * cause problems with the bulk copy implementation and is
+	 * small enough not to bother.
+	 */
+	andi t0, a2, -(2 * SZREG)
+	beqz t0, byte_copy
+
+	/*
+	 * Now solve for t5 and t6.
+	 */
+	andi t5, t3, -SZREG
+	andi t6, t4, -SZREG
+	/*
+	 * If dest(Register t3) rounded down to the nearest naturally
+	 * aligned SZREG address, does not equal dest, then add SZREG
+	 * to find the low-bound of SZREG alignment in the dest memory
+	 * region.  Note that this could overshoot the dest memory
+	 * region if n is less than SZREG.  This is one reason why
+	 * we always byte copy if n is less than SZREG.
+	 * Otherwise, dest is already naturally aligned to SZREG.
+	 */
+	beq  t5, t3, 1f
+	addi t5, t5, SZREG
+	1:
+
+	/*
+	 * If the dest and src are co-aligned to SZREG, then there is
+	 * no need for the full rigmarole of a full misaligned fixup copy.
+	 * Instead, do a simpler co-aligned copy.
+	 */
+	xor  t0, a0, a1
+	andi t1, t0, (SZREG - 1)
+	beqz t1, coaligned_copy
+	/* Fall through to misaligned fixup copy */
+
+misaligned_fixup_copy:
+	bltu a1, a0, misaligned_fixup_copy_reverse
+
+misaligned_fixup_copy_forward:
+	jal  t0, byte_copy_until_aligned_forward
+
+	andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */
+	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+	sub  a5, a1, t3 /* Find the difference between src and dest */
+	andi a1, a1, -SZREG /* Align the src pointer */
+	addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/
+
+	/*
+	 * Compute The Inverse Shift
+	 * a7 = XLEN - a6 = XLEN + -a6
+	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
+	 * Add that to XLEN.  XLEN = SZREG * 8.
+	 */
+	not  a7, a6
+	addi a7, a7, (SZREG * 8 + 1)
+
+	/*
+	 * Fix Misalignment Copy Loop - Forward
+	 * load_val0 = load_ptr[0];
+	 * do {
+	 * 	load_val1 = load_ptr[1];
+	 * 	store_ptr += 2;
+	 * 	store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7});
+	 *
+	 * 	if (store_ptr == {a2})
+	 * 		break;
+	 *
+	 * 	load_val0 = load_ptr[2];
+	 * 	load_ptr += 2;
+	 * 	store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7});
+	 *
+	 * } while (store_ptr != store_ptr_end);
+	 * store_ptr = store_ptr_end;
+	 */
+
+	REG_L t0, (0 * SZREG)(a1)
+	1:
+	REG_L t1, (1 * SZREG)(a1)
+	addi  t3, t3, (2 * SZREG)
+	srl   t0, t0, a6
+	sll   t2, t1, a7
+	or    t2, t0, t2
+	REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3)
+
+	beq   t3, a2, 2f
+
+	REG_L t0, (2 * SZREG)(a1)
+	addi  a1, a1, (2 * SZREG)
+	srl   t1, t1, a6
+	sll   t2, t0, a7
+	or    t2, t1, t2
+	REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3)
+
+	bne   t3, t6, 1b
+	2:
+	mv    t3, t6 /* Fix the dest pointer in case the loop was broken */
+
+	add  a1, t3, a5 /* Restore the src pointer */
+	j byte_copy_forward /* Copy any remaining bytes */
+
+misaligned_fixup_copy_reverse:
+	jal  t0, byte_copy_until_aligned_reverse
+
+	andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
+	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+	sub  a5, a4, t4 /* Find the difference between src and dest */
+	andi a4, a4, -SZREG /* Align the src pointer */
+	addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/
+
+	/*
+	 * Compute The Inverse Shift
+	 * a7 = XLEN - a6 = XLEN + -a6
+	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
+	 * Add that to XLEN.  XLEN = SZREG * 8.
+	 */
+	not  a7, a6
+	addi a7, a7, (SZREG * 8 + 1)
+
+	/*
+	 * Fix Misalignment Copy Loop - Reverse
+	 * load_val1 = load_ptr[0];
+	 * do {
+	 * 	load_val0 = load_ptr[-1];
+	 * 	store_ptr -= 2;
+	 * 	store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7});
+	 *
+	 * 	if (store_ptr == {a2})
+	 * 		break;
+	 *
+	 * 	load_val1 = load_ptr[-2];
+	 * 	load_ptr -= 2;
+	 * 	store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7});
+	 *
+	 * } while (store_ptr != store_ptr_end);
+	 * store_ptr = store_ptr_end;
+	 */
+
+	REG_L t1, ( 0 * SZREG)(a4)
+	1:
+	REG_L t0, (-1 * SZREG)(a4)
+	addi  t4, t4, (-2 * SZREG)
+	sll   t1, t1, a7
+	srl   t2, t0, a6
+	or    t2, t1, t2
+	REG_S t2, ( 1 * SZREG)(t4)
+
+	beq   t4, a2, 2f
+
+	REG_L t1, (-2 * SZREG)(a4)
+	addi  a4, a4, (-2 * SZREG)
+	sll   t0, t0, a7
+	srl   t2, t1, a6
+	or    t2, t0, t2
+	REG_S t2, ( 0 * SZREG)(t4)
+
+	bne   t4, t5, 1b
+	2:
+	mv    t4, t5 /* Fix the dest pointer in case the loop was broken */
+
+	add  a4, t4, a5 /* Restore the src pointer */
+	j byte_copy_reverse /* Copy any remaining bytes */
+
+/*
+ * Simple copy loops for SZREG co-aligned memory locations.
+ * These also make calls to do byte copies for any unaligned
+ * data at their terminations.
+ */
+coaligned_copy:
+	bltu a1, a0, coaligned_copy_reverse
+
+coaligned_copy_forward:
+	jal t0, byte_copy_until_aligned_forward
+
+	1:
+	REG_L t1, ( 0 * SZREG)(a1)
+	addi  a1, a1, SZREG
+	addi  t3, t3, SZREG
+	REG_S t1, (-1 * SZREG)(t3)
+	bne   t3, t6, 1b
+
+	j byte_copy_forward /* Copy any remaining bytes */
+
+coaligned_copy_reverse:
+	jal t0, byte_copy_until_aligned_reverse
+
+	1:
+	REG_L t1, (-1 * SZREG)(a4)
+	addi  a4, a4, -SZREG
+	addi  t4, t4, -SZREG
+	REG_S t1, ( 0 * SZREG)(t4)
+	bne   t4, t5, 1b
+
+	j byte_copy_reverse /* Copy any remaining bytes */
+
+/*
+ * These are basically sub-functions within the function.  They
+ * are used to byte copy until the dest pointer is in alignment.
+ * At which point, a bulk copy method can be used by the
+ * calling code.  These work on the same registers as the bulk
+ * copy loops.  Therefore, the register values can be picked
+ * up from where they were left and we avoid code duplication
+ * without any overhead except the call in and return jumps.
+ */
+byte_copy_until_aligned_forward:
+	beq  t3, t5, 2f
+	1:
+	lb   t1,  0(a1)
+	addi a1, a1, 1
+	addi t3, t3, 1
+	sb   t1, -1(t3)
+	bne  t3, t5, 1b
+	2:
+	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+byte_copy_until_aligned_reverse:
+	beq  t4, t6, 2f
+	1:
+	lb   t1, -1(a4)
+	addi a4, a4, -1
+	addi t4, t4, -1
+	sb   t1,  0(t4)
+	bne  t4, t6, 1b
+	2:
+	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+/*
+ * Simple byte copy loops.
+ * These will byte copy until they reach the end of data to copy.
+ * At that point, they will call to return from memmove.
+ */
+byte_copy:
+	bltu a1, a0, byte_copy_reverse
+
+byte_copy_forward:
+	beq  t3, t4, 2f
+	1:
+	lb   t1,  0(a1)
+	addi a1, a1, 1
+	addi t3, t3, 1
+	sb   t1, -1(t3)
+	bne  t3, t4, 1b
+	2:
+	ret
+
+byte_copy_reverse:
+	beq  t4, t3, 2f
+	1:
+	lb   t1, -1(a4)
+	addi a4, a4, -1
+	addi t4, t4, -1
+	sb   t1,  0(t4)
+	bne  t4, t3, 1b
+	2:
+
+return_from_memmove:
+	ret
-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] [PATCH 1/3] RISC-V: Optimize memset
  2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
@ 2023-06-07 12:57   ` Rich Felker
  0 siblings, 0 replies; 5+ messages in thread
From: Rich Felker @ 2023-06-07 12:57 UTC (permalink / raw)
  To: zhangfei; +Cc: musl, zhangfei

On Wed, Jun 07, 2023 at 06:07:08PM +0800, zhangfei wrote:
> From: zhangfei <zhangfei@nj.iscas.ac.cn>
> 
> This code is based on linux/arch/riscv/lib/memset.S. Removed macro definition and modified
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> to support RISCV64.
> When the amount of data in the source code is less than 16 bytes or after loop tail
> processing, byte storage is used. Here we refer to musl/src/string/memset.c, and modify it
> to fill head and tail with minimal branching.
> 
> Signed-off-by: Zhang Fei<zhangfei@nj.iscas.ac.cn>
> ---
>  src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 136 insertions(+)
>  create mode 100644 src/string/riscv64/memset.S
> 
> diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
> new file mode 100644
> index 0000000..f8663d7
> --- /dev/null
> +++ b/src/string/riscv64/memset.S
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
                               ^^^^^^^^^^^^

This completely precludes any consideration for inclusion. Please do
not send license-incompatible code to the mailing list. Not only can
we not use it, but putting it in front of people actually working on
code suitable for musl makes us work extra hard to avoid taint.

You're free to link it into your own products (assuming you're
honoring your obligations under the GPL...), and doing so will get you
pretty much the entire benefit of having had this in libc.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-06-07 12:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 10:07 [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove zhangfei
2023-06-07 10:07 ` [musl] [PATCH 1/3] RISC-V: Optimize memset zhangfei
2023-06-07 12:57   ` Rich Felker
2023-06-07 10:07 ` [musl] [PATCH 2/3] RISC-V: Optimize memcpy zhangfei
2023-06-07 10:07 ` [musl] [PATCH 3/3] RISC-V: Optimize memmove zhangfei

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).