From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,LOTS_OF_MONEY,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 28557 invoked from network); 6 Aug 2021 17:30:56 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 6 Aug 2021 17:30:56 -0000
Received: (qmail 16262 invoked by uid 550); 6 Aug 2021 17:30:54 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 16244 invoked from network); 6 Aug 2021 17:30:53 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de;
	s=vfde-smtpout-mb-15sep; t=1628271034;
	bh=Tl3LQNR5157rLnYwudtVHcLieMPopMNdJzUSK2PWf4M=;
	h=From:To:Cc:References:In-Reply-To:Subject:Date;
	b=IXPl07kyu0bzPX64Ph8Wt7qafe+MJIbFs97XUqLWhsUaW2oP/CZm6H6Sx5s4uxKm/
	 Z0lNkursK19Gsdo6l/AtTS7cSscMX6T/xRVpYib8F4AezuH+Lh+sfdQGKYZhK2/crY
	 Osxcw5+D21BbCkxiTM9eDHh+xs/djS6PpBVfRZ8A=
Message-ID: <35E4CF2B08294E46B0451043F46D2C7C@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: "Rich Felker" <dalias@libc.org>
Cc: "Alexander Monakov" <amonakov@ispras.ru>,
	"Szabolcs Nagy" <nsz@port70.net>,
	<musl@lists.openwall.com>
References: <04BD4026EE364FF7AFBAF8C593E9A2E7@H270> <20210803202735.GA37904@port70.net> <DFEFCEFB42FD4CBB9CD45CD321B57F1A@H270> <alpine.LNX.2.20.13.2108051626420.2536@monopod.intra.ispras.ru> <6C4DCCC86B014B68877D73C798F54180@H270> <20210806142702.GV13220@brightrain.aerifal.cx>
In-Reply-To: <20210806142702.GV13220@brightrain.aerifal.cx>
Date: Fri, 6 Aug 2021 19:23:19 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: multipart/mixed;
	boundary="----=_NextPart_000_8AC0_01D78AF8.83616680"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-purgate-type: clean
X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de
X-purgate: This mail is considered clean (visit http://www.eleven.de for further information)
X-purgate: clean
X-purgate-size: 6602
X-purgate-ID: 155817::1628271034-00000B26-6A70B190/0/0
Subject: Re: [musl] [Patch] src/math/i386/remquo.s: remove conditional branch, shorter bit twiddling

This is a multi-part message in MIME format.

------=_NextPart_000_8AC0_01D78AF8.83616680
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Rich Felker <dalias@libc.org> wrote:

If you don't want patches for assembly modules, please state so VERY
CLEAR in your FAQ.


> On Fri, Aug 06, 2021 at 12:17:12PM +0200, Stefan Kanthak wrote:
>> Alexander Monakov <amonakov@ispras.ru> wrote:
>> 
>> > On Wed, 4 Aug 2021, Stefan Kanthak wrote:
>> >> The change just follows by removing 6 LOC/instructions.-)
>> > 
>> > Have you considered collecting the three bits in one go via a multiplication?
>> 
>> No. My mind is not that twisted;-)
>> 
>> > You can first isolate the necessary bits with 'and $0x4300, %eax', then do
>> > 'imul $0x910000, %eax, %eax' to put the required bits in EAX[31:29] in the
>> > right order, then shift right by 29. Three instructions, 14 bytes.
>> 
>> Thanks, VERY NICE! How did you come up to it?
>> 
>> Revised patch with shorter bit twiddling attached.
> 
> The path forward for all the math asm is moving it to inline asm in C
> files, with no flow control or bit/register shuffling in the asm, only
> using asm for the single instructions. See how Alexander Monakov did
> x86_64 remquol in commit 19f870c3a68a959c7c6ef1de12086ac908920e5e.

This commit is for i386 fmod/fmodf/fmodl. The bit twiddling used in
<https://git.musl-libc.org/cgit/musl/plain/src/math/x86_64/remquol.c>
(which I hadn't noticed yet) and the code GCC generates for it is but
(almost) as bad as the original assembly code:

|        shrl    $8, %eax
|        movl    %eax, %ecx
|        movabsq $8463725162920157216, %rax
|        rolb    $4, %cl
|        andl    $60, %ecx
|        sarq    %cl, %rax
|        andl    $7, %eax

vs.

|        mov %ah,%dl
|        shr %dl
|        and $1,%dl
|        mov %ah,%al
|        shr $5,%al
|        and $2,%al
|        or %al,%dl
|        mov %ah,%al
|        shl $2,%al
|        and $4,%al
|        or %al,%dl

> I haven't read the mul trick here in detail but I believe it should be
> duplicable with plain C * operator.

It is.

> I really do not want to review/merge asm changes that keep this kind
> of complex logic in asm when there's no strong motivation for it (like
> fixing an actual bug, vs just reducing size or improving speed). The
> risk to reward ratio is just not reasonable.

Final patch attached!

Stefan
------=_NextPart_000_8AC0_01D78AF8.83616680
Content-Type: application/octet-stream;
	name="remquo.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="remquo.patch"

--- -/src/math/x86_64/remquol.c=0A=
+++ +/src/math/x86_64/remquol.c=0A=
@@ -16,2 +16,2 @@=0A=
-	unsigned fpsr;=0A=
-	do __asm__ ("fprem1; fnstsw %%ax" : "+t"(t), "=3Da"(fpsr) : "u"(y));=0A=
+	unsigned short fpsr;=0A=
+	do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A=
@@ -23,6 +23,1 @@=0A=
-	unsigned char i =3D fpsr >> 8;=0A=
-	i =3D i>>4 | i<<4;=0A=
-	/* i[5:2] is now {b0 b2 ? b1}. Retrieve {0 b2 b1 b0} via=0A=
-	 * in-register table lookup. */=0A=
-	unsigned qbits =3D 0x7575313164642020 >> (i & 60);=0A=
-	qbits &=3D 7;=0A=
+	unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A=
=0A=
--- /dev/null=0A=
+++ +/src/math/i386/remquof.c=0A=
@@ -0,0 +1,15 @@=0A=
+#include <math.h>=0A=
+=0A=
+float remquof(float x, float y, int *quo)=0A=
+{=0A=
+	/* see ../x86_64/remquol.c */=0A=
+	signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A=
+	__asm__ ("" :: "X"(cx), "X"(cy));=0A=
+	float t =3D x;=0A=
+	unsigned short fpsr;=0A=
+	do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A=
+	while (fpsr & 0x400);=0A=
+	unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A=
+	*quo =3D (cx[sizeof(x) - 1]^cy[sizeof(y) - 1]) < 0 ? -qbits : qbits;=0A=
+	return t;=0A=
+}=0A=
=0A=
--- -/src/math/i386/remquof.s=0A=
+++ /dev/null=0A=
@@ -1,1 +0,0 @@=0A=
-# see remquo.s=0A=
=0A=
--- /dev/null=0A=
+++ +/src/math/i386/remquol.c=0A=
@@ -0,0 +1,17 @@=0A=
+#include <math.h>=0A=
+=0A=
+long double remquol(long double x, long double y, int *quo)=0A=
+{=0A=
+	/* see ../x86_64/remquol.c */=0A=
+	signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A=
+	__asm__ ("" :: "X"(cx), "X"(cy));=0A=
+	long double t =3D x;=0A=
+	unsigned short fpsr;=0A=
+	do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A=
+	while (fpsr & 0x400);=0A=
+	unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A=
+	/* [sizeof(long double) - 1] not usable here due to=0A=
+	   GCC's braindead handling of long double alias tbyte */=0A=
+	*quo =3D (cx[9]^cy[9]) < 0 ? -qbits : qbits;=0A=
+	return t;=0A=
+}=0A=
=0A=
--- -/src/math/i386/remquol.s=0A=
+++ /dev/null=0A=
@@ -1,1 +0,0 @@=0A=
-# see remquo.s=0A=
=0A=
--- /dev/null=0A=
+++ +/src/math/i386/remquo.c=0A=
@@ -0,0 +1,14 @@=0A=
+#include <math.h>=0A=
+=0A=
+double remquo(double x, double y, int *quo)=0A=
+{=0A=
+	/* see ../x86_64/remquol.c */=0A=
+	signed char *cx =3D (void *)&x, *cy =3D (void *)&y;=0A=
+	__asm__ ("" :: "X"(cx), "X"(cy));=0A=
+	double t =3D x;=0A=
+	unsigned short fpsr;=0A=
+	do __asm__ ("fprem1; fnstsw %0" : "=3Da"(fpsr), "+t"(t) : "u"(y));=0A=
+	while (fpsr & 0x400);=0A=
+	unsigned qbits =3D (fpsr & 0x4300) * 0x910000u >> 29;=0A=
+	*quo =3D (cx[sizeof(x) - 1]^cy[sizeof(y) - 1]) < 0 ? -qbits : qbits;=0A=
+	return t;=0A=
+}=0A=
=0A=
--- -/src/math/i386/remquo.s=0A=
+++ /dev/null=0A=
@@ -1,50 +0,0 @@=0A=
-.global remquof=0A=
-.type remquof,@function=0A=
-remquof:=0A=
-	mov 12(%esp),%ecx=0A=
-	flds 8(%esp)=0A=
-	flds 4(%esp)=0A=
-	mov 11(%esp),%dh=0A=
-	xor 7(%esp),%dh=0A=
-	jmp 1f=0A=
-=0A=
-.global remquol=0A=
-.type remquol,@function=0A=
-remquol:=0A=
-	mov 28(%esp),%ecx=0A=
-	fldt 16(%esp)=0A=
-	fldt 4(%esp)=0A=
-	mov 25(%esp),%dh=0A=
-	xor 13(%esp),%dh=0A=
-	jmp 1f=0A=
-=0A=
-.global remquo=0A=
-.type remquo,@function=0A=
-remquo:=0A=
-	mov 20(%esp),%ecx=0A=
-	fldl 12(%esp)=0A=
-	fldl 4(%esp)=0A=
-	mov 19(%esp),%dh=0A=
-	xor 11(%esp),%dh=0A=
-1:	fprem1=0A=
-	fnstsw %ax=0A=
-	sahf=0A=
-	jp 1b=0A=
-	fstp %st(1)=0A=
-	mov %ah,%dl=0A=
-	shr %dl=0A=
-	and $1,%dl=0A=
-	mov %ah,%al=0A=
-	shr $5,%al=0A=
-	and $2,%al=0A=
-	or %al,%dl=0A=
-	mov %ah,%al=0A=
-	shl $2,%al=0A=
-	and $4,%al=0A=
-	or %al,%dl=0A=
-	test %dh,%dh=0A=
-	jns 1f=0A=
-	neg %dl=0A=
-1:	movsbl %dl,%edx=0A=
-	mov %edx,(%ecx)=0A=
-	ret=0A=

------=_NextPart_000_8AC0_01D78AF8.83616680--