From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_LOW, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 1937 invoked from network); 7 Jun 2023 10:08:33 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 7 Jun 2023 10:08:33 -0000 Received: (qmail 26549 invoked by uid 550); 7 Jun 2023 10:08:16 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 26322 invoked from network); 7 Jun 2023 10:08:14 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=IWLhZ ZskRvN+i966+Wjwod8tI4dBx6DxuQIChK+7xzo=; b=LMod26hXHw3u9vheVZmkb Wa03DY4oaJFeZKujVfKr1l0+n928Nizkb0v0I1li9/eqhD8bG/Ox7nzHsZlQ1SaB mYasexkwe1KtJE4MFybFTYGOkV/XSdy/j1U9/40u4l8Wc0QUqEd1V43sgmf9cvpQ xoqv7MbWfx4U44R7LsElXQ= From: zhangfei To: dalias@libc.org, musl@lists.openwall.com Cc: zhangfei Date: Wed, 7 Jun 2023 18:07:07 +0800 Message-Id: <20230607100710.4286-1-zhang_fei_0403@163.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wBXhbbPVoBki6btBg--.27251S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxWrWkZw47KF15Xw1kWFW7urg_yoWrtw4rpr 43Jr43Kr17tryxJw4ftan0yrn0qrWrtr1UK3ySka4rCr1vkas8XFW7ua109FyxJrW8GryS qw1fXF18uF45Aa7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07UYPfdUUUUU= X-Originating-IP: [180.111.101.91] X-CM-SenderInfo: x2kd0w5bihxsiquqjqqrwthudrp/1tbiMhaHl1WB4CtvhAAAsA Subject: [musl] [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove From: zhangfei Hi, Currently, the risc-v architecture in the kernel source code uses assembly implemented memset, memcpy, and memmove. As shown in the link below: [1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S [2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S [3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S I have modified it to a form that can be compiled in musl. At the same time, I noticed that aarch64 and x86 in musl have assembly implementations of these functions, so I hope these patches can be integrated into musl. memset.S refers to the handling of data volume less than 8 bytes in musl/src/string/memset.c, and modifies the byte storage to fill head and tail with minimal branching. The original memcpy.S in the kernel uses byte-wise copy if src and dst are not co-aligned.This approach is not efficient enough.Therefore, the patch linked below was used to optimize the memcpy.S of the kernel. [4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/ [5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/ memmove.S did not make too many modifications, just made it independent of the kernel's header files and could be compiled separately in musl. The testing platform selected RISC-V SiFive U74.I used the code linked below for performance testing. [6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/ Compared the performance of C language in musl and assembly implementation, the test results are as follows: memset.c in musl: --------------------- Random memset (bytes/ns): memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25 Medium memset (bytes/ns): memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72 Large memset (bytes/ns): memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39 memset.S: --------------------- Random memset (bytes/ns): memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31 Medium memset (bytes/ns): memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67 Large memset (bytes/ns): memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39 memcpy.c in musl: --------------------- Random memcpy (bytes/ns): memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18 Aligned medium memcpy (bytes/ns): memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19 Unaligned medium memcpy (bytes/ns): memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91 Large memcpy (bytes/ns): memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56 memcpy.S: --------------------- Random memcpy (bytes/ns): memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21 Aligned medium memcpy (bytes/ns): memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27 Unaligned medium memcpy (bytes/ns): memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16 Large memcpy (bytes/ns): memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75 memmove.c in musl: --------------------- Unaligned forwards memmove (bytes/ns): memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20 Unaligned backwards memmove (bytes/ns): memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24 memmove.S: --------------------- Unaligned forwards memmove (bytes/ns): memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81 Unaligned backwards memmove (bytes/ns): memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83 It can be seen that the basic instruction implementations of memset, memcpy, and memmove have better performance improvements compared to the C implementation in musl. Please review the code. Thanks, Zhang Fei zhangfei (3): RISC-V: Optimize memset RISC-V: Optimize memcpy RISC-V: Optimize memmove src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++ src/string/riscv64/memcpy.S | 159 ++++++++++++++++++++++++++++++++++++ src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++ 3 file changed, 610 insertions(+) create mode 100644 src/string/riscv64/memset.S create mode 100644 src/string/riscv64/memcpy.S create mode 100644 src/string/riscv64/memmove.S