From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Tue, 23 Feb 2016 07:27:42 -0800
To: 9fans@9fans.net
Message-ID: <45fced5cac8155366565f0195c0b47b9@lilly.quanstro.net>
In-Reply-To: <71A3F6B7-CC61-468D-B8B2-3D46AB92483D@gmail.com>
References: <e115154c5bab8971d7b88f64ba5d4402@proxima.alt.za>
	<71A3F6B7-CC61-468D-B8B2-3D46AB92483D@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans] Go: FP in note handler
Topicbox-Message-UUID: 88ee7e64-ead9-11e9-9d60-3106f5b1d025

On Tue Feb 23 02:36:41 PST 2016, kennylevinsen@gmail.com wrote:
> Ah, no - it is not a system-wide adjustment, but adjustment of the plan=
9 specific runtime.sighandler implementation and everything called by it =
directly. Notes that don't exit the process are queued and should run out=
side the actual note handler.
>=20
> I think the "magic" code will be isolated, and might fend off accidenta=
l future additions of floating point registers. The magic-ness also only =
revolves around avoiding duffzero and duffcopy in some way. I also think =
that removing conditionals in the compiler will be a positive thing.
>=20
> I still do not know the feasibility of my plan, whether it is possible =
to do cleanly, or possible at all. Maybe someone smarter than me with kno=
wledge on the matter could chime in and call me an idiot?
>=20
> Avoiding duffcopy should be easy with a simple memmove implementation. =
If done right, we can also remove the plan9 specific runtime.memmove and =
only use the slow memmove in sighandler (The globlal runtime.memmove is i=
mplemented using MOVUPS just like duffcopy. Duffcopy is used for blockcop=
ies by the compiler in some cases, although I must admit to not know all =
the cases yet).
>=20
> Avoiding duffzero without compiler assistance is a bit more tricky - gl=
obal variables, stack on assembly functions, something like that.

fwiw, on modern amd64 machines, using the xmm and ymm registers has a ben=
efit only in a narrow range
of sizes (384-511 bytes) and a subset of (mis-)alignments that i've forgo=
tten.  at least for the exact test setup
i used on 3-4 different =C2=B5arches.  intel claims rep; movs is the (arc=
hitecturally) fastest way to go.

i am not sure any of this makes much difference, as it's hard to know wha=
t a real-world memory
access pattern looks like, and that seems to dominate all but gigantic mo=
ves, for which rep; movs
is actually no slower than even the trickiest use of ymm registers.

- erik