From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,LOTS_OF_MONEY,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 28557 invoked from network); 6 Aug 2021 17:30:56 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 6 Aug 2021 17:30:56 -0000 Received: (qmail 16262 invoked by uid 550); 6 Aug 2021 17:30:54 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 16244 invoked from network); 6 Aug 2021 17:30:53 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de; s=vfde-smtpout-mb-15sep; t=1628271034; bh=Tl3LQNR5157rLnYwudtVHcLieMPopMNdJzUSK2PWf4M=; h=From:To:Cc:References:In-Reply-To:Subject:Date; b=IXPl07kyu0bzPX64Ph8Wt7qafe+MJIbFs97XUqLWhsUaW2oP/CZm6H6Sx5s4uxKm/ Z0lNkursK19Gsdo6l/AtTS7cSscMX6T/xRVpYib8F4AezuH+Lh+sfdQGKYZhK2/crY Osxcw5+D21BbCkxiTM9eDHh+xs/djS6PpBVfRZ8A= Message-ID: <35E4CF2B08294E46B0451043F46D2C7C@H270> From: "Stefan Kanthak" To: "Rich Felker" Cc: "Alexander Monakov" , "Szabolcs Nagy" , References: <04BD4026EE364FF7AFBAF8C593E9A2E7@H270> <20210803202735.GA37904@port70.net> <6C4DCCC86B014B68877D73C798F54180@H270> <20210806142702.GV13220@brightrain.aerifal.cx> In-Reply-To: <20210806142702.GV13220@brightrain.aerifal.cx> Date: Fri, 6 Aug 2021 19:23:19 +0200 Organization: Me, myself & IT MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_8AC0_01D78AF8.83616680" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6002.18197 X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158 X-purgate-type: clean X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de X-purgate: This mail is considered clean (visit http://www.eleven.de for further information) X-purgate: clean X-purgate-size: 6602 X-purgate-ID: 155817::1628271034-00000B26-6A70B190/0/0 Subject: Re: [musl] [Patch] src/math/i386/remquo.s: remove conditional branch, shorter bit twiddling This is a multi-part message in MIME format. ------=_NextPart_000_8AC0_01D78AF8.83616680 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Rich Felker wrote: If you don't want patches for assembly modules, please state so VERY CLEAR in your FAQ. > On Fri, Aug 06, 2021 at 12:17:12PM +0200, Stefan Kanthak wrote: >> Alexander Monakov wrote: >> >> > On Wed, 4 Aug 2021, Stefan Kanthak wrote: >> >> The change just follows by removing 6 LOC/instructions.-) >> > >> > Have you considered collecting the three bits in one go via a multiplication? >> >> No. My mind is not that twisted;-) >> >> > You can first isolate the necessary bits with 'and $0x4300, %eax', then do >> > 'imul $0x910000, %eax, %eax' to put the required bits in EAX[31:29] in the >> > right order, then shift right by 29. Three instructions, 14 bytes. >> >> Thanks, VERY NICE! How did you come up to it? >> >> Revised patch with shorter bit twiddling attached. > > The path forward for all the math asm is moving it to inline asm in C > files, with no flow control or bit/register shuffling in the asm, only > using asm for the single instructions. See how Alexander Monakov did > x86_64 remquol in commit 19f870c3a68a959c7c6ef1de12086ac908920e5e. This commit is for i386 fmod/fmodf/fmodl. The bit twiddling used in (which I hadn't noticed yet) and the code GCC generates for it is but (almost) as bad as the original assembly code: | shrl $8, %eax | movl %eax, %ecx | movabsq $8463725162920157216, %rax | rolb $4, %cl | andl $60, %ecx | sarq %cl, %rax | andl $7, %eax vs. | mov %ah,%dl | shr %dl | and $1,%dl | mov %ah,%al | shr $5,%al | and $2,%al | or %al,%dl | mov %ah,%al | shl $2,%al | and $4,%al | or %al,%dl > I haven't read the mul trick here in detail but I believe it should be > duplicable with plain C * operator. It is. > I really do not want to review/merge asm changes that keep this kind > of complex logic in asm when there's no strong motivation for it (like > fixing an actual bug, vs just reducing size or improving speed). The > risk to reward ratio is just not reasonable. Final patch attached! Stefan ------=_NextPart_000_8AC0_01D78AF8.83616680 Content-Type: application/octet-stream; name="remquo.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="remquo.patch" --- -/src/math/x86_64/remquol.c=0A= +++ +/src/math/x86_64/remquol.c=0A= @@ -16,2 +16,2 @@=0A= - unsigned fpsr;=0A= - do __asm__ ("fprem1; fnstsw %%ax" : "+t"(t), "=3Da"(fpsr) : "u"(y));=0A= + unsigned short fpsr;=0A= + do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A= @@ -23,6 +23,1 @@=0A= - unsigned char i =3D fpsr >> 8;=0A= - i =3D i>>4 | i<<4;=0A= - /* i[5:2] is now {b0 b2 ? b1}. Retrieve {0 b2 b1 b0} via=0A= - * in-register table lookup. */=0A= - unsigned qbits =3D 0x7575313164642020 >> (i & 60);=0A= - qbits &=3D 7;=0A= + unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A= =0A= --- /dev/null=0A= +++ +/src/math/i386/remquof.c=0A= @@ -0,0 +1,15 @@=0A= +#include =0A= +=0A= +float remquof(float x, float y, int *quo)=0A= +{=0A= + /* see ../x86_64/remquol.c */=0A= + signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A= + __asm__ ("" :: "X"(cx), "X"(cy));=0A= + float t =3D x;=0A= + unsigned short fpsr;=0A= + do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A= + while (fpsr & 0x400);=0A= + unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A= + *quo =3D (cx[sizeof(x) - 1]^cy[sizeof(y) - 1]) < 0 ? -qbits : qbits;=0A= + return t;=0A= +}=0A= =0A= --- -/src/math/i386/remquof.s=0A= +++ /dev/null=0A= @@ -1,1 +0,0 @@=0A= -# see remquo.s=0A= =0A= --- /dev/null=0A= +++ +/src/math/i386/remquol.c=0A= @@ -0,0 +1,17 @@=0A= +#include =0A= +=0A= +long double remquol(long double x, long double y, int *quo)=0A= +{=0A= + /* see ../x86_64/remquol.c */=0A= + signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A= + __asm__ ("" :: "X"(cx), "X"(cy));=0A= + long double t =3D x;=0A= + unsigned short fpsr;=0A= + do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A= + while (fpsr & 0x400);=0A= + unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A= + /* [sizeof(long double) - 1] not usable here due to=0A= + GCC's braindead handling of long double alias tbyte */=0A= + *quo =3D (cx[9]^cy[9]) < 0 ? -qbits : qbits;=0A= + return t;=0A= +}=0A= =0A= --- -/src/math/i386/remquol.s=0A= +++ /dev/null=0A= @@ -1,1 +0,0 @@=0A= -# see remquo.s=0A= =0A= --- /dev/null=0A= +++ +/src/math/i386/remquo.c=0A= @@ -0,0 +1,14 @@=0A= +#include =0A= +=0A= +double remquo(double x, double y, int *quo)=0A= +{=0A= + /* see ../x86_64/remquol.c */=0A= + signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A= + __asm__ ("" :: "X"(cx), "X"(cy));=0A= + double t =3D x;=0A= + unsigned short fpsr;=0A= + do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A= + while (fpsr & 0x400);=0A= + unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A= + *quo =3D (cx[sizeof(x) - 1]^cy[sizeof(y) - 1]) < 0 ? -qbits : qbits;=0A= + return t;=0A= +}=0A= =0A= --- -/src/math/i386/remquo.s=0A= +++ /dev/null=0A= @@ -1,50 +0,0 @@=0A= -.global remquof=0A= -.type remquof,@function=0A= -remquof:=0A= - mov 12(%esp),%ecx=0A= - flds 8(%esp)=0A= - flds 4(%esp)=0A= - mov 11(%esp),%dh=0A= - xor 7(%esp),%dh=0A= - jmp 1f=0A= -=0A= -.global remquol=0A= -.type remquol,@function=0A= -remquol:=0A= - mov 28(%esp),%ecx=0A= - fldt 16(%esp)=0A= - fldt 4(%esp)=0A= - mov 25(%esp),%dh=0A= - xor 13(%esp),%dh=0A= - jmp 1f=0A= -=0A= -.global remquo=0A= -.type remquo,@function=0A= -remquo:=0A= - mov 20(%esp),%ecx=0A= - fldl 12(%esp)=0A= - fldl 4(%esp)=0A= - mov 19(%esp),%dh=0A= - xor 11(%esp),%dh=0A= -1: fprem1=0A= - fnstsw %ax=0A= - sahf=0A= - jp 1b=0A= - fstp %st(1)=0A= - mov %ah,%dl=0A= - shr %dl=0A= - and $1,%dl=0A= - mov %ah,%al=0A= - shr $5,%al=0A= - and $2,%al=0A= - or %al,%dl=0A= - mov %ah,%al=0A= - shl $2,%al=0A= - and $4,%al=0A= - or %al,%dl=0A= - test %dh,%dh=0A= - jns 1f=0A= - neg %dl=0A= -1: movsbl %dl,%edx=0A= - mov %edx,(%ecx)=0A= - ret=0A= ------=_NextPart_000_8AC0_01D78AF8.83616680--