From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Tue, 23 Feb 2016 07:27:42 -0800 To: 9fans@9fans.net Message-ID: <45fced5cac8155366565f0195c0b47b9@lilly.quanstro.net> In-Reply-To: <71A3F6B7-CC61-468D-B8B2-3D46AB92483D@gmail.com> References: <71A3F6B7-CC61-468D-B8B2-3D46AB92483D@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] Go: FP in note handler Topicbox-Message-UUID: 88ee7e64-ead9-11e9-9d60-3106f5b1d025 On Tue Feb 23 02:36:41 PST 2016, kennylevinsen@gmail.com wrote: > Ah, no - it is not a system-wide adjustment, but adjustment of the plan= 9 specific runtime.sighandler implementation and everything called by it = directly. Notes that don't exit the process are queued and should run out= side the actual note handler. >=20 > I think the "magic" code will be isolated, and might fend off accidenta= l future additions of floating point registers. The magic-ness also only = revolves around avoiding duffzero and duffcopy in some way. I also think = that removing conditionals in the compiler will be a positive thing. >=20 > I still do not know the feasibility of my plan, whether it is possible = to do cleanly, or possible at all. Maybe someone smarter than me with kno= wledge on the matter could chime in and call me an idiot? >=20 > Avoiding duffcopy should be easy with a simple memmove implementation. = If done right, we can also remove the plan9 specific runtime.memmove and = only use the slow memmove in sighandler (The globlal runtime.memmove is i= mplemented using MOVUPS just like duffcopy. Duffcopy is used for blockcop= ies by the compiler in some cases, although I must admit to not know all = the cases yet). >=20 > Avoiding duffzero without compiler assistance is a bit more tricky - gl= obal variables, stack on assembly functions, something like that. fwiw, on modern amd64 machines, using the xmm and ymm registers has a ben= efit only in a narrow range of sizes (384-511 bytes) and a subset of (mis-)alignments that i've forgo= tten. at least for the exact test setup i used on 3-4 different =C2=B5arches. intel claims rep; movs is the (arc= hitecturally) fastest way to go. i am not sure any of this makes much difference, as it's hard to know wha= t a real-world memory access pattern looks like, and that seems to dominate all but gigantic mo= ves, for which rep; movs is actually no slower than even the trickiest use of ymm registers. - erik