From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_LOW, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 1912 invoked from network); 7 Jun 2023 10:08:19 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 7 Jun 2023 10:08:19 -0000 Received: (qmail 26354 invoked by uid 550); 7 Jun 2023 10:08:14 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 26313 invoked from network); 7 Jun 2023 10:08:13 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=ZZ5vq cvc7ewc3wzb5Tui8ok7c/ULbw4KIGMEqaqEoZk=; b=H9OG0VzT0DORDDGNNmwLn ARqCoj5NZMMvI34OW6TM5WnzjxOpNY3/v2GirvzUtPxeiiKAi7e16fByz5rdZnZx MwQBsTFQH/uSK3sz5faLbrDnz2eyXx8g2Nu3pKa5E1+RCkse/jmD0xpOy2YP5nJm BZRCnIKIDrQsCfLo4ikORg= From: zhangfei To: dalias@libc.org, musl@lists.openwall.com Cc: zhangfei Date: Wed, 7 Jun 2023 18:07:08 +0800 Message-Id: <20230607100710.4286-2-zhang_fei_0403@163.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230607100710.4286-1-zhang_fei_0403@163.com> References: <20230607100710.4286-1-zhang_fei_0403@163.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wBXhbbPVoBki6btBg--.27251S3 X-Coremail-Antispam: 1Uf129KBjvJXoWxWw1xGw4DCF4DCFyfCr4Durg_yoW5trW5pw 4rK34rK34DKF1SgrWaqF15JFs8Jwsaq3Z5WwnFyr12yryUKFy7Zas0qayYqwnrtrZ3Cr43 ZF1qyr4Uua45JrDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07Uf9N3UUUUU= X-Originating-IP: [180.111.101.91] X-CM-SenderInfo: x2kd0w5bihxsiquqjqqrwthudrp/1tbiFRaHl15mP3czNwABsm Subject: [musl] [PATCH 1/3] RISC-V: Optimize memset From: zhangfei This code is based on linux/arch/riscv/lib/memset.S. Removed macro definition and modified to support RISCV64. When the amount of data in the source code is less than 16 bytes or after loop tail processing, byte storage is used. Here we refer to musl/src/string/memset.c, and modify it to fill head and tail with minimal branching. Signed-off-by: Zhang Fei --- src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 src/string/riscv64/memset.S diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S new file mode 100644 index 0000000..f8663d7 --- /dev/null +++ b/src/string/riscv64/memset.S @@ -0,0 +1,136 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 Regents of the University of California + */ + +#define SZREG 8 +#define REG_S sd + +.global memset +.type memset,@function +memset: + move t0, a0 /* Preserve return value */ + + /* Defer to byte-oriented fill for small sizes */ + sltiu a3, a2, 16 + bnez a3, 4f + + /* + * Round to nearest XLEN-aligned address + * greater than or equal to start address + */ + addi a3, t0, SZREG-1 + andi a3, a3, ~(SZREG-1) + beq a3, t0, 2f /* Skip if already aligned */ + /* Handle initial misalignment */ + sub a4, a3, t0 +1: + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 1b + sub a2, a2, a4 /* Update count */ + +2: + andi a1, a1, 0xff + slli a3, a1, 8 + or a1, a3, a1 + slli a3, a1, 16 + or a1, a3, a1 + slli a3, a1, 32 + or a1, a3, a1 + + /* Calculate end address */ + andi a4, a2, ~(SZREG-1) + add a3, t0, a4 + + andi a4, a4, 31*SZREG /* Calculate remainder */ + beqz a4, 3f /* Shortcut if no remainder */ + neg a4, a4 + addi a4, a4, 32*SZREG /* Calculate initial offset */ + + /* Adjust start address with offset */ + sub t0, t0, a4 + + /* Jump into loop body */ + /* Assumes 64-bit instruction lengths */ + la a5, 3f + srli a4, a4, 1 + add a5, a5, a4 + jr a5 +3: + REG_S a1, 0(t0) + REG_S a1, SZREG(t0) + REG_S a1, 2*SZREG(t0) + REG_S a1, 3*SZREG(t0) + REG_S a1, 4*SZREG(t0) + REG_S a1, 5*SZREG(t0) + REG_S a1, 6*SZREG(t0) + REG_S a1, 7*SZREG(t0) + REG_S a1, 8*SZREG(t0) + REG_S a1, 9*SZREG(t0) + REG_S a1, 10*SZREG(t0) + REG_S a1, 11*SZREG(t0) + REG_S a1, 12*SZREG(t0) + REG_S a1, 13*SZREG(t0) + REG_S a1, 14*SZREG(t0) + REG_S a1, 15*SZREG(t0) + REG_S a1, 16*SZREG(t0) + REG_S a1, 17*SZREG(t0) + REG_S a1, 18*SZREG(t0) + REG_S a1, 19*SZREG(t0) + REG_S a1, 20*SZREG(t0) + REG_S a1, 21*SZREG(t0) + REG_S a1, 22*SZREG(t0) + REG_S a1, 23*SZREG(t0) + REG_S a1, 24*SZREG(t0) + REG_S a1, 25*SZREG(t0) + REG_S a1, 26*SZREG(t0) + REG_S a1, 27*SZREG(t0) + REG_S a1, 28*SZREG(t0) + REG_S a1, 29*SZREG(t0) + REG_S a1, 30*SZREG(t0) + REG_S a1, 31*SZREG(t0) + addi t0, t0, 32*SZREG + bltu t0, a3, 3b + andi a2, a2, SZREG-1 /* Update count */ + +4: + /* Handle trailing misalignment */ + beqz a2, 6f + add a3, t0, a2 +5: + /* Fill head and tail with minimal branching. Each + * conditional ensures that all the subsequently used + * offsets are well-defined and in the dest region. */ + sb a1, 0(t0) + sb a1, -1(a3) + li a4, 2 + bgeu a4, a2, 6f + + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(a3) + sb a1, -3(a3) + li a4, 6 + bgeu a4, a2, 6f + + sb a1, 3(t0) + sb a1, -4(a3) + li a4, 8 + bgeu a4, a2, 6f + + sb a1, 4(t0) + sb a1, 5(t0) + sb a1, -5(a3) + li a4, 11 + bgeu a4, a2, 6f + + sb a1, 6(t0) + sb a1, -6(a3) + sb a1, -7(a3) + li a4, 14 + bgeu a4, a2, 6f + + sb a1, 7(t0) +6: + ret -- 2.34.1