[musl] [PATCH] x86/memset: avoid performing final store twice

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] [PATCH] x86/memset: avoid performing final store twice
@ 2020-10-03 22:32 Denys Vlasenko
  2020-10-11  0:25 ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Denys Vlasenko @ 2020-10-03 22:32 UTC (permalink / raw)
  To: Rich Felker; +Cc: Denys Vlasenko, musl

From: Denys Vlasenko <dvlasenk@redhat.com>

For not very short NBYTES case:

To handle the tail alignment, the code performs a potentially
misaligned word store to fill the final 8 bytes of the buffer.
This is done even if the buffer's end is aligned.

Eventually code fills the rest of the buffer, which is a multiple
of 8 bytes now, with NBYTES / 8 aligned word stores.

However, this means that if NBYTES *was* divisible by 8,
we store last word too, again.

This patch decrements byte count before dividing it by 8,
making one less store in "NBYTES is divisible by 8" case,
and not changing anything in all other cases.

CC: Rich Felker <dalias@libc.org>
CC: musl@lists.openwall.com
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
---
 src/string/i386/memset.s   | 7 ++++---
 src/string/x86_64/memset.s | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/string/i386/memset.s b/src/string/i386/memset.s
index d00422c4..b1c5c2f8 100644
--- a/src/string/i386/memset.s
+++ b/src/string/i386/memset.s
@@ -47,7 +47,7 @@ memset:
 	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
 	mov %edx,(-1-2-4-8-4)(%eax,%ecx)
 
-1:	ret 	
+1:	ret
 
 2:	movzbl 8(%esp),%eax
 	mov %edi,12(%esp)
@@ -57,13 +57,14 @@ memset:
 	mov %eax,-4(%edi,%ecx)
 	jnz 2f
 
-1:	shr $2, %ecx
+1:	dec %ecx
+	shr $2, %ecx
 	rep
 	stosl
 	mov 4(%esp),%eax
 	mov 12(%esp),%edi
 	ret
-	
+
 2:	xor %edx,%edx
 	sub %edi,%edx
 	and $15,%edx
diff --git a/src/string/x86_64/memset.s b/src/string/x86_64/memset.s
index 2d3f5e52..85bb686c 100644
--- a/src/string/x86_64/memset.s
+++ b/src/string/x86_64/memset.s
@@ -53,7 +53,7 @@ memset:
 2:	test $15,%edi
 	mov %rdi,%r8
 	mov %rax,-8(%rdi,%rdx)
-	mov %rdx,%rcx
+	lea -1(%rdx),%rcx
 	jnz 2f
 
 1:	shr $3,%rcx
-- 
2.25.0


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] [PATCH] x86/memset: avoid performing final store twice
  2020-10-03 22:32 [musl] [PATCH] x86/memset: avoid performing final store twice Denys Vlasenko
@ 2020-10-11  0:25 ` Rich Felker
  2020-10-12 12:18   ` Denys Vlasenko
  0 siblings, 1 reply; 3+ messages in thread
From: Rich Felker @ 2020-10-11  0:25 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Denys Vlasenko, musl

On Sun, Oct 04, 2020 at 12:32:09AM +0200, Denys Vlasenko wrote:
> From: Denys Vlasenko <dvlasenk@redhat.com>
> 
> For not very short NBYTES case:
> 
> To handle the tail alignment, the code performs a potentially
> misaligned word store to fill the final 8 bytes of the buffer.
> This is done even if the buffer's end is aligned.
> 
> Eventually code fills the rest of the buffer, which is a multiple
> of 8 bytes now, with NBYTES / 8 aligned word stores.
> 
> However, this means that if NBYTES *was* divisible by 8,
> we store last word too, again.
> 
> This patch decrements byte count before dividing it by 8,
> making one less store in "NBYTES is divisible by 8" case,
> and not changing anything in all other cases.
> 
> CC: Rich Felker <dalias@libc.org>
> CC: musl@lists.openwall.com
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> ---
>  src/string/i386/memset.s   | 7 ++++---
>  src/string/x86_64/memset.s | 2 +-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/src/string/i386/memset.s b/src/string/i386/memset.s
> index d00422c4..b1c5c2f8 100644
> --- a/src/string/i386/memset.s
> +++ b/src/string/i386/memset.s
> @@ -47,7 +47,7 @@ memset:
>  	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
>  	mov %edx,(-1-2-4-8-4)(%eax,%ecx)
>  
> -1:	ret 	
> +1:	ret
>  
>  2:	movzbl 8(%esp),%eax
>  	mov %edi,12(%esp)
> @@ -57,13 +57,14 @@ memset:
>  	mov %eax,-4(%edi,%ecx)
>  	jnz 2f
>  
> -1:	shr $2, %ecx
> +1:	dec %ecx
> +	shr $2, %ecx
>  	rep
>  	stosl
>  	mov 4(%esp),%eax
>  	mov 12(%esp),%edi
>  	ret
> -	
> +
>  2:	xor %edx,%edx
>  	sub %edi,%edx
>  	and $15,%edx
> diff --git a/src/string/x86_64/memset.s b/src/string/x86_64/memset.s
> index 2d3f5e52..85bb686c 100644
> --- a/src/string/x86_64/memset.s
> +++ b/src/string/x86_64/memset.s
> @@ -53,7 +53,7 @@ memset:
>  2:	test $15,%edi
>  	mov %rdi,%r8
>  	mov %rax,-8(%rdi,%rdx)
> -	mov %rdx,%rcx
> +	lea -1(%rdx),%rcx
>  	jnz 2f
>  
>  1:	shr $3,%rcx
> -- 
> 2.25.0

Does this have measurably better performance on a system you've tested
it on? Indeed it nominally stores less, but it also performs an extra
arithmetic operation. I doubt it makes anything worse but I'd like to
avoid making random changes to such functions unless there's a good
reason to believe it actually helps.

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] [PATCH] x86/memset: avoid performing final store twice
  2020-10-11  0:25 ` Rich Felker
@ 2020-10-12 12:18   ` Denys Vlasenko
  0 siblings, 0 replies; 3+ messages in thread
From: Denys Vlasenko @ 2020-10-12 12:18 UTC (permalink / raw)
  To: Rich Felker, Denys Vlasenko; +Cc: musl

On 10/11/20 2:25 AM, Rich Felker wrote:
> On Sun, Oct 04, 2020 at 12:32:09AM +0200, Denys Vlasenko wrote:
>> From: Denys Vlasenko <dvlasenk@redhat.com>
>>
>> For not very short NBYTES case:
>>
>> To handle the tail alignment, the code performs a potentially
>> misaligned word store to fill the final 8 bytes of the buffer.
>> This is done even if the buffer's end is aligned.
>>
>> Eventually code fills the rest of the buffer, which is a multiple
>> of 8 bytes now, with NBYTES / 8 aligned word stores.
>>
>> However, this means that if NBYTES *was* divisible by 8,
>> we store last word too, again.
>>
>> This patch decrements byte count before dividing it by 8,
>> making one less store in "NBYTES is divisible by 8" case,
>> and not changing anything in all other cases.
>>
...
>> --- a/src/string/x86_64/memset.s
>> +++ b/src/string/x86_64/memset.s
>> @@ -53,7 +53,7 @@ memset:
>>   2:	test $15,%edi
>>   	mov %rdi,%r8
>>   	mov %rax,-8(%rdi,%rdx)
>> -	mov %rdx,%rcx
>> +	lea -1(%rdx),%rcx
>>   	jnz 2f
>>   
>>   1:	shr $3,%rcx
>> -- 
>> 2.25.0
> 
> Does this have measurably better performance on a system you've tested
> it on?

I did not test performance, I predict it will hardly be detectable.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-10-12 12:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-03 22:32 [musl] [PATCH] x86/memset: avoid performing final store twice Denys Vlasenko
2020-10-11  0:25 ` Rich Felker
2020-10-12 12:18   ` Denys Vlasenko

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).