From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 11063 invoked from network); 11 Oct 2020 00:25:29 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 11 Oct 2020 00:25:29 -0000 Received: (qmail 9310 invoked by uid 550); 11 Oct 2020 00:25:27 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 9279 invoked from network); 11 Oct 2020 00:25:26 -0000 Date: Sat, 10 Oct 2020 20:25:14 -0400 From: Rich Felker To: Denys Vlasenko Cc: Denys Vlasenko , musl@lists.openwall.com Message-ID: <20201011002514.GF17637@brightrain.aerifal.cx> References: <20201003223209.10307-1-vda.linux@googlemail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201003223209.10307-1-vda.linux@googlemail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] [PATCH] x86/memset: avoid performing final store twice On Sun, Oct 04, 2020 at 12:32:09AM +0200, Denys Vlasenko wrote: > From: Denys Vlasenko > > For not very short NBYTES case: > > To handle the tail alignment, the code performs a potentially > misaligned word store to fill the final 8 bytes of the buffer. > This is done even if the buffer's end is aligned. > > Eventually code fills the rest of the buffer, which is a multiple > of 8 bytes now, with NBYTES / 8 aligned word stores. > > However, this means that if NBYTES *was* divisible by 8, > we store last word too, again. > > This patch decrements byte count before dividing it by 8, > making one less store in "NBYTES is divisible by 8" case, > and not changing anything in all other cases. > > CC: Rich Felker > CC: musl@lists.openwall.com > Signed-off-by: Denys Vlasenko > --- > src/string/i386/memset.s | 7 ++++--- > src/string/x86_64/memset.s | 2 +- > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/src/string/i386/memset.s b/src/string/i386/memset.s > index d00422c4..b1c5c2f8 100644 > --- a/src/string/i386/memset.s > +++ b/src/string/i386/memset.s > @@ -47,7 +47,7 @@ memset: > mov %edx,(-1-2-4-8-8)(%eax,%ecx) > mov %edx,(-1-2-4-8-4)(%eax,%ecx) > > -1: ret > +1: ret > > 2: movzbl 8(%esp),%eax > mov %edi,12(%esp) > @@ -57,13 +57,14 @@ memset: > mov %eax,-4(%edi,%ecx) > jnz 2f > > -1: shr $2, %ecx > +1: dec %ecx > + shr $2, %ecx > rep > stosl > mov 4(%esp),%eax > mov 12(%esp),%edi > ret > - > + > 2: xor %edx,%edx > sub %edi,%edx > and $15,%edx > diff --git a/src/string/x86_64/memset.s b/src/string/x86_64/memset.s > index 2d3f5e52..85bb686c 100644 > --- a/src/string/x86_64/memset.s > +++ b/src/string/x86_64/memset.s > @@ -53,7 +53,7 @@ memset: > 2: test $15,%edi > mov %rdi,%r8 > mov %rax,-8(%rdi,%rdx) > - mov %rdx,%rcx > + lea -1(%rdx),%rcx > jnz 2f > > 1: shr $3,%rcx > -- > 2.25.0 Does this have measurably better performance on a system you've tested it on? Indeed it nominally stores less, but it also performs an extra arithmetic operation. I doubt it makes anything worse but I'd like to avoid making random changes to such functions unless there's a good reason to believe it actually helps. Rich