From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 29085 invoked from network); 27 Dec 2021 15:00:29 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 27 Dec 2021 15:00:29 -0000
Received: (qmail 15624 invoked by uid 550); 27 Dec 2021 15:00:25 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 15585 invoked from network); 27 Dec 2021 15:00:24 -0000
Date: Mon, 27 Dec 2021 10:00:11 -0500
From: Rich Felker <dalias@libc.org>
To: Markus Wichmann <nullplan@gmx.net>
Cc: musl@lists.openwall.com
Message-ID: <20211227150010.GU7074@brightrain.aerifal.cx>
References: <20211226204238.GA1949@voyager>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20211226204238.GA1949@voyager>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] ASM-to-C conversion for i386

On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> Hi all,
> 
> merry Christmas, everyone. I hope you survived the various family
> visitations in good health and are slowly coming out of the food coma,
> or whatever your anual rituals are.
> 
> Anyway, I found myself with a bit of time on my hands and chose to be
> productive for once. Rich made some noise however long ago that he
> wanted to move from assembly source code files to C source code files
> with inline assembly. So I looked at what I could contribute to that
> cause.

Thank you. Alexander Monakov started this with some changes ending at
commit bc87299ce72a52f4debf9fc19d859abe34dbdf43, but I haven't really
looked at it since then, so it's nice to see interest again!

> So I've converted __set_thread_area(). That was pretty straightforward
> once I found SYSCALL_NO_TLS. The generated assembly generated by clang
> 6.0.0 hits the same notes as the handwritten code, so I'm willing to
> count that as a win.

Somewhere I had an old version of this which didn't fare so well, so
it will be interesting to compare.

> For the maths code,

One thing I found right away on the math code: You can't use "=t" for
output in a float or double variable unless the asm naturally
guarantees there's no excess precision (e.g. fmod where the operation
is exact and the precision is necessarily bounded by the input
precision). Otherwise you need "=t" to a temporary long double output
to then let the compiler convert down to float or double (e.g. at time
of return; I think we're still using explicit cast there too in case
some compilers get this wrong).

> I've added the likely() and unlikely() macros to
> libm.h. Not sure if they belong there, but they do make the generated
> assembly more similar to the handwritten code.

My leaning was always to leave out unless/until there's evidence it
matters, but nsz seems to have found they did and introduced them just
with names predict_true and predict_false which are already there.
This was probably a compromise between the "likely" notation I don't
like (reads as a predicate lik "if this is likely" rather than "if X,
which is likely") and my rather odd suggestion of as_expected (if
(as_expected(x))). :-P

> Most of that code was straightforward, but some of the more complex
> functions I am not sure about. What is up with __exp2l()? I can see that
> expl() is calling it, but I'm not sure why. But its existence forced me
> to employ a technique not used elsewhere in the code (that I could
> find): A hidden alias. I vaguely recall that such hackery was rejected
> before (on grounds of old binutils reacting badly to such magic), but I
> don't really know what else I could have done. Or was the correct way to
> make __exp2l() a hidden function with the actual implementation and
> exp2l() (without the underscores) a weak alias?

Here it doesn't really matter at all since they're both in the
baseline C namespace.

> Anyway, the maths code suffers from massive code duplication on both
> assembler and C levels. Not sure what to do about it, though. In many
> cases, each of the three versions of a function only differ in the fine
> details, but clang being as inline happy as it is means that many
> techniques to reduce code duplication in C cause bloated object files in
> assembler. For example, all functions of the floor, ceil, and trunc
> families have been implemented in floor.c, in terms of a new static
> function I called "rndint()", containing the heart of what used to be at
> label 1 in floor.s. Unfortunately, after compiling, clang has inlined
> rndint() every time, so that floor.o contains all nine functions, and
> all functions are substantially copies of rndint(). The only solution I
> would see to that would have been to rename "rndint()" to something with
> a double underscore at the start, make it hidden and extern, and move
> all the functions into their own files, thus preventing inlining and
> making the object files more modular. Not sure how you'd like it.

This was really just a "DRY" optimization at the source level, not an
attempt to save a net 8 insns per function, and a significant part of
the reason I want to get rid of the asm source files. A bug in the
copy-and-paste of that asm, or a bug that got fixed in all but one
copy of it, would have been a huge pain to deal with. So I think
having them all share an inline function is fine. Of course since
there's then no reason for them to be in the same source file, they
should probably be moved to their corresponding named files instead of
empty .c files.

> Also, the generated assembly tends to use more memory. It appears that
> clang is hesitant to overwrite memory allocated to a variable, even if
> that variable is currently parked in a register. Or maybe my clang
> version is just weird. That also explains why it sometimes emits "fld"
> instructions in the wrong order and then fixes the mistake with "fxch".
> Not a huge deal, just weird. Nothing forces the wrong order. And the
> order is often correct in the smaller precision versions of the same
> function.

I wouldn't really characterize that as using memory anyway; at most
it's using one additional cache line of memory already in use (and
likely already in cache from other calls anyway).

> Many of the maths functions are testing if their argument is subnormal,
> and return an underflow exception if so and the argument is not zero.
> For the single-precision case, the idiom used was to square the input,
> which I have recreated with FORCE_EVAL(). For the double-precision case,
> however, it was to store the variable as single precision.

I'm not sure; I might ask what nsz thinks is best on this, or compare
how it's done in C math sources (although here we can make a
different choice depending on what is best on x86).

> Finally, I have also converted fenv.s today. I was hesitant to do that
> at first, since a general C framework for fenv is under development, but
> it has been quite a while since I've heard a peep from that project. In
> any case, since their code should overwrite all of the existing fenv
> code, a merge would now just lead to trivial path conflicts that are
> easily resolved.
> 
> I believe in doing the conversion, I found a bug in feclearexcept(). The
> original code said in the non-SSE version (context: EAX contains the
> status word, ECX contains the function argument, and "1b" is a function
> return)
> 
> |	test %eax,%ecx
> |	jz 1b
> |	not %ecx
> |	and %ecx,%eax
> |	test $0x3f,%eax
> |	jz 1f
> |	fnclex
> |	jmp 1b
> |1:	sub $32,%esp
> |	fnstenv (%esp)
> |	mov %al,4(%esp)
> |	fldenv (%esp)
> |	add $32,%esp
> |	xor %eax,%eax
> |	ret
> 
> That second "jz" confuses me. The intent seems to be to test if any
> exceptions remain, and use "fnclex" if not. That would make sense, since
> "fnclex" clears all exceptions. But since the second "jz" is a "jz" and
> not a "jnz", the "fnclex" path is used only if exceptions remain, and
> the slower "fldenv" path is used if none remain. Or am I reading this
> wrong?
> 
> Anyway, I implemented the logic that made sense to me in the C version.

I haven't looked at this in a long time and will have to look in more
detail later.

> What remains to be done? Well, looking at the list of assembler files,
> the only targets for a C conversion that remain (in i386) are the string
> functions. After that, it is time to clean up and submit patches.
> 
> Speaking of, how would you like those? One patch for everything, one
> patch per directory (i.e. one for thread, one for math, one for fenv,
> one for string), or one per functions group (the three precisions of
> each function), or one per function? I don't want to overwhelm you.

Probably split up somewhat so that things that would need discussion
separately or that could be conceptually right/wrong independent of
one another (not just minor bugs) are separate commits. After looking
at the commits in your repo more I can probably make a better
recommendation.

Rich