From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.5 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_LOW,
	T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 1937 invoked from network); 7 Jun 2023 10:08:33 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 7 Jun 2023 10:08:33 -0000
Received: (qmail 26549 invoked by uid 550); 7 Jun 2023 10:08:16 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 26322 invoked from network); 7 Jun 2023 10:08:14 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com;
	s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=IWLhZ
	ZskRvN+i966+Wjwod8tI4dBx6DxuQIChK+7xzo=; b=LMod26hXHw3u9vheVZmkb
	Wa03DY4oaJFeZKujVfKr1l0+n928Nizkb0v0I1li9/eqhD8bG/Ox7nzHsZlQ1SaB
	mYasexkwe1KtJE4MFybFTYGOkV/XSdy/j1U9/40u4l8Wc0QUqEd1V43sgmf9cvpQ
	xoqv7MbWfx4U44R7LsElXQ=
From: zhangfei <zhang_fei_0403@163.com>
To: dalias@libc.org,
	musl@lists.openwall.com
Cc: zhangfei <zhangfei@nj.iscas.ac.cn>
Date: Wed,  7 Jun 2023 18:07:07 +0800
Message-Id: <20230607100710.4286-1-zhang_fei_0403@163.com>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-CM-TRANSID:_____wBXhbbPVoBki6btBg--.27251S2
X-Coremail-Antispam: 1Uf129KBjvJXoWxWrWkZw47KF15Xw1kWFW7urg_yoWrtw4rpr
	43Jr43Kr17tryxJw4ftan0yrn0qrWrtr1UK3ySka4rCr1vkas8XFW7ua109FyxJrW8GryS
	qw1fXF18uF45Aa7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07UYPfdUUUUU=
X-Originating-IP: [180.111.101.91]
X-CM-SenderInfo: x2kd0w5bihxsiquqjqqrwthudrp/1tbiMhaHl1WB4CtvhAAAsA
Subject: [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove

From: zhangfei <zhangfei@nj.iscas.ac.cn>

Hi,

Currently, the risc-v architecture in the kernel source code uses assembly 
implemented memset, memcpy, and memmove. As shown in the link below:

[1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S
[2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S
[3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S

I have modified it to a form that can be compiled in musl. At the same time, 
I noticed that aarch64 and x86 in musl have assembly implementations of 
these functions, so I hope these patches can be integrated into musl.

memset.S refers to the handling of data volume less than 8 bytes in 
musl/src/string/memset.c, and modifies the byte storage to fill head and 
tail with minimal branching.

The original memcpy.S in the kernel uses byte-wise copy if src and dst are 
not co-aligned.This approach is not efficient enough.Therefore, the patch 
linked below was used to optimize the memcpy.S of the kernel.

[4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/
[5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/

memmove.S did not make too many modifications, just made it independent of 
the kernel's header files and could be compiled separately in musl.

The testing platform selected RISC-V SiFive U74.I used the code linked below 
for performance testing.

[6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/

Compared the performance of C language in musl and assembly implementation, 
the test results are as follows:

memset.c in musl:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25

Medium memset (bytes/ns):
           memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72
Large memset (bytes/ns):
           memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39

memset.S:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31

Medium memset (bytes/ns):
           memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67
Large memset (bytes/ns):
           memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39


memcpy.c in musl:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91

Large memcpy (bytes/ns):
           memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56

memcpy.S:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16

Large memcpy (bytes/ns):
           memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75


memmove.c in musl:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24

memmove.S:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83

It can be seen that the basic instruction implementations of memset, memcpy, and
memmove have better performance improvements compared to the C implementation in 
musl. Please review the code.

Thanks,
Zhang Fei

zhangfei (3):
  RISC-V: Optimize memset
  RISC-V: Optimize memcpy
  RISC-V: Optimize memmove

 src/string/riscv64/memset.S  | 136 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memcpy.S  | 159 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
 3 file changed, 610 insertions(+)
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memcpy.S
 create mode 100644 src/string/riscv64/memmove.S