[musl] ASM-to-C conversion for i386

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] ASM-to-C conversion for i386
@ 2021-12-26 20:42 Markus Wichmann
  2021-12-26 21:20 ` Markus Wichmann
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-26 20:42 UTC (permalink / raw)
  To: musl

Hi all,

merry Christmas, everyone. I hope you survived the various family
visitations in good health and are slowly coming out of the food coma,
or whatever your anual rituals are.

Anyway, I found myself with a bit of time on my hands and chose to be
productive for once. Rich made some noise however long ago that he
wanted to move from assembly source code files to C source code files
with inline assembly. So I looked at what I could contribute to that
cause.

This is hindered somewhat by the fact that my knowledge of assembler is
restricted to x86, PowerPC, and Microblaze. And for Microblaze, it has
been a while since I've used it.. For ARM and most of the others, I can
get the gist, but there may be subtleties I am not grasping, and that is
precisely what we cannot use for such a conversion.

So I decided to start with the architecture I am most familiar with:
i386. And now I am finished with the largest part of it, the maths code.
That is, finished with the first pass.

You can follow the progress here: https://github.com/nullplan/musl/tree/asm2c

So I've converted __set_thread_area(). That was pretty straightforward
once I found SYSCALL_NO_TLS. The generated assembly generated by clang
6.0.0 hits the same notes as the handwritten code, so I'm willing to
count that as a win.

For the maths code, I've added the likely() and unlikely() macros to
libm.h. Not sure if they belong there, but they do make the generated
assembly more similar to the handwritten code.

Most of that code was straightforward, but some of the more complex
functions I am not sure about. What is up with __exp2l()? I can see that
expl() is calling it, but I'm not sure why. But its existence forced me
to employ a technique not used elsewhere in the code (that I could
find): A hidden alias. I vaguely recall that such hackery was rejected
before (on grounds of old binutils reacting badly to such magic), but I
don't really know what else I could have done. Or was the correct way to
make __exp2l() a hidden function with the actual implementation and
exp2l() (without the underscores) a weak alias?

Anyway, the maths code suffers from massive code duplication on both
assembler and C levels. Not sure what to do about it, though. In many
cases, each of the three versions of a function only differ in the fine
details, but clang being as inline happy as it is means that many
techniques to reduce code duplication in C cause bloated object files in
assembler. For example, all functions of the floor, ceil, and trunc
families have been implemented in floor.c, in terms of a new static
function I called "rndint()", containing the heart of what used to be at
label 1 in floor.s. Unfortunately, after compiling, clang has inlined
rndint() every time, so that floor.o contains all nine functions, and
all functions are substantially copies of rndint(). The only solution I
would see to that would have been to rename "rndint()" to something with
a double underscore at the start, make it hidden and extern, and move
all the functions into their own files, thus preventing inlining and
making the object files more modular. Not sure how you'd like it.

Also, the generated assembly tends to use more memory. It appears that
clang is hesitant to overwrite memory allocated to a variable, even if
that variable is currently parked in a register. Or maybe my clang
version is just weird. That also explains why it sometimes emits "fld"
instructions in the wrong order and then fixes the mistake with "fxch".
Not a huge deal, just weird. Nothing forces the wrong order. And the
order is often correct in the smaller precision versions of the same
function.

Many of the maths functions are testing if their argument is subnormal,
and return an underflow exception if so and the argument is not zero.
For the single-precision case, the idiom used was to square the input,
which I have recreated with FORCE_EVAL(). For the double-precision case,
however, it was to store the variable as single precision.

Finally, I have also converted fenv.s today. I was hesitant to do that
at first, since a general C framework for fenv is under development, but
it has been quite a while since I've heard a peep from that project. In
any case, since their code should overwrite all of the existing fenv
code, a merge would now just lead to trivial path conflicts that are
easily resolved.

I believe in doing the conversion, I found a bug in feclearexcept(). The
original code said in the non-SSE version (context: EAX contains the
status word, ECX contains the function argument, and "1b" is a function
return)

|	test %eax,%ecx
|	jz 1b
|	not %ecx
|	and %ecx,%eax
|	test $0x3f,%eax
|	jz 1f
|	fnclex
|	jmp 1b
|1:	sub $32,%esp
|	fnstenv (%esp)
|	mov %al,4(%esp)
|	fldenv (%esp)
|	add $32,%esp
|	xor %eax,%eax
|	ret

That second "jz" confuses me. The intent seems to be to test if any
exceptions remain, and use "fnclex" if not. That would make sense, since
"fnclex" clears all exceptions. But since the second "jz" is a "jz" and
not a "jnz", the "fnclex" path is used only if exceptions remain, and
the slower "fldenv" path is used if none remain. Or am I reading this
wrong?

Anyway, I implemented the logic that made sense to me in the C version.

What remains to be done? Well, looking at the list of assembler files,
the only targets for a C conversion that remain (in i386) are the string
functions. After that, it is time to clean up and submit patches.

Speaking of, how would you like those? One patch for everything, one
patch per directory (i.e. one for thread, one for math, one for fenv,
one for string), or one per functions group (the three precisions of
each function), or one per function? I don't want to overwhelm you.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-26 20:42 [musl] ASM-to-C conversion for i386 Markus Wichmann
@ 2021-12-26 21:20 ` Markus Wichmann
  2021-12-27 13:08 ` Markus Wichmann
  2021-12-27 15:00 ` Rich Felker
  2 siblings, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-26 21:20 UTC (permalink / raw)
  To: musl

On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> Many of the maths functions are testing if their argument is subnormal,
> and return an underflow exception if so and the argument is not zero.
> For the single-precision case, the idiom used was to square the input,
> which I have recreated with FORCE_EVAL(). For the double-precision case,
> however, it was to store the variable as single precision.

Dang it, I forgot to ask the point this was leading towards: Is there
already an idiom in use in the maths code elsewhere to force an
underflow exception for subnormal nonzero inputs?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-26 20:42 [musl] ASM-to-C conversion for i386 Markus Wichmann
  2021-12-26 21:20 ` Markus Wichmann
@ 2021-12-27 13:08 ` Markus Wichmann
  2021-12-27 15:00 ` Rich Felker
  2 siblings, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-27 13:08 UTC (permalink / raw)
  To: musl

Hi all,

so, I've pushed the string functions as well. Looking at the remaining
assembler files, those all need to be assembler.

Only dlsym could possibly be written in C by making use of a GNU C
feature (__builtin_return_address()), but that has been proposed and
rejected before. All the others require explicit stack control, explicit
instruction control, have non-standard calling conventions, or are doing
something you else you cannot possibly do in C.

So, any comments about the files in my repo before I make patches? And
again, how would you like those patches? I think I will wait until next
year with the patches.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-26 20:42 [musl] ASM-to-C conversion for i386 Markus Wichmann
  2021-12-26 21:20 ` Markus Wichmann
  2021-12-27 13:08 ` Markus Wichmann
@ 2021-12-27 15:00 ` Rich Felker
  2021-12-27 16:27   ` Markus Wichmann
                     ` (2 more replies)
  2 siblings, 3 replies; 10+ messages in thread
From: Rich Felker @ 2021-12-27 15:00 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> Hi all,
> 
> merry Christmas, everyone. I hope you survived the various family
> visitations in good health and are slowly coming out of the food coma,
> or whatever your anual rituals are.
> 
> Anyway, I found myself with a bit of time on my hands and chose to be
> productive for once. Rich made some noise however long ago that he
> wanted to move from assembly source code files to C source code files
> with inline assembly. So I looked at what I could contribute to that
> cause.

Thank you. Alexander Monakov started this with some changes ending at
commit bc87299ce72a52f4debf9fc19d859abe34dbdf43, but I haven't really
looked at it since then, so it's nice to see interest again!

> So I've converted __set_thread_area(). That was pretty straightforward
> once I found SYSCALL_NO_TLS. The generated assembly generated by clang
> 6.0.0 hits the same notes as the handwritten code, so I'm willing to
> count that as a win.

Somewhere I had an old version of this which didn't fare so well, so
it will be interesting to compare.

> For the maths code,

One thing I found right away on the math code: You can't use "=t" for
output in a float or double variable unless the asm naturally
guarantees there's no excess precision (e.g. fmod where the operation
is exact and the precision is necessarily bounded by the input
precision). Otherwise you need "=t" to a temporary long double output
to then let the compiler convert down to float or double (e.g. at time
of return; I think we're still using explicit cast there too in case
some compilers get this wrong).

> I've added the likely() and unlikely() macros to
> libm.h. Not sure if they belong there, but they do make the generated
> assembly more similar to the handwritten code.

My leaning was always to leave out unless/until there's evidence it
matters, but nsz seems to have found they did and introduced them just
with names predict_true and predict_false which are already there.
This was probably a compromise between the "likely" notation I don't
like (reads as a predicate lik "if this is likely" rather than "if X,
which is likely") and my rather odd suggestion of as_expected (if
(as_expected(x))). :-P

> Most of that code was straightforward, but some of the more complex
> functions I am not sure about. What is up with __exp2l()? I can see that
> expl() is calling it, but I'm not sure why. But its existence forced me
> to employ a technique not used elsewhere in the code (that I could
> find): A hidden alias. I vaguely recall that such hackery was rejected
> before (on grounds of old binutils reacting badly to such magic), but I
> don't really know what else I could have done. Or was the correct way to
> make __exp2l() a hidden function with the actual implementation and
> exp2l() (without the underscores) a weak alias?

Here it doesn't really matter at all since they're both in the
baseline C namespace.

> Anyway, the maths code suffers from massive code duplication on both
> assembler and C levels. Not sure what to do about it, though. In many
> cases, each of the three versions of a function only differ in the fine
> details, but clang being as inline happy as it is means that many
> techniques to reduce code duplication in C cause bloated object files in
> assembler. For example, all functions of the floor, ceil, and trunc
> families have been implemented in floor.c, in terms of a new static
> function I called "rndint()", containing the heart of what used to be at
> label 1 in floor.s. Unfortunately, after compiling, clang has inlined
> rndint() every time, so that floor.o contains all nine functions, and
> all functions are substantially copies of rndint(). The only solution I
> would see to that would have been to rename "rndint()" to something with
> a double underscore at the start, make it hidden and extern, and move
> all the functions into their own files, thus preventing inlining and
> making the object files more modular. Not sure how you'd like it.

This was really just a "DRY" optimization at the source level, not an
attempt to save a net 8 insns per function, and a significant part of
the reason I want to get rid of the asm source files. A bug in the
copy-and-paste of that asm, or a bug that got fixed in all but one
copy of it, would have been a huge pain to deal with. So I think
having them all share an inline function is fine. Of course since
there's then no reason for them to be in the same source file, they
should probably be moved to their corresponding named files instead of
empty .c files.

> Also, the generated assembly tends to use more memory. It appears that
> clang is hesitant to overwrite memory allocated to a variable, even if
> that variable is currently parked in a register. Or maybe my clang
> version is just weird. That also explains why it sometimes emits "fld"
> instructions in the wrong order and then fixes the mistake with "fxch".
> Not a huge deal, just weird. Nothing forces the wrong order. And the
> order is often correct in the smaller precision versions of the same
> function.

I wouldn't really characterize that as using memory anyway; at most
it's using one additional cache line of memory already in use (and
likely already in cache from other calls anyway).

> Many of the maths functions are testing if their argument is subnormal,
> and return an underflow exception if so and the argument is not zero.
> For the single-precision case, the idiom used was to square the input,
> which I have recreated with FORCE_EVAL(). For the double-precision case,
> however, it was to store the variable as single precision.

I'm not sure; I might ask what nsz thinks is best on this, or compare
how it's done in C math sources (although here we can make a
different choice depending on what is best on x86).

> Finally, I have also converted fenv.s today. I was hesitant to do that
> at first, since a general C framework for fenv is under development, but
> it has been quite a while since I've heard a peep from that project. In
> any case, since their code should overwrite all of the existing fenv
> code, a merge would now just lead to trivial path conflicts that are
> easily resolved.
> 
> I believe in doing the conversion, I found a bug in feclearexcept(). The
> original code said in the non-SSE version (context: EAX contains the
> status word, ECX contains the function argument, and "1b" is a function
> return)
> 
> |	test %eax,%ecx
> |	jz 1b
> |	not %ecx
> |	and %ecx,%eax
> |	test $0x3f,%eax
> |	jz 1f
> |	fnclex
> |	jmp 1b
> |1:	sub $32,%esp
> |	fnstenv (%esp)
> |	mov %al,4(%esp)
> |	fldenv (%esp)
> |	add $32,%esp
> |	xor %eax,%eax
> |	ret
> 
> That second "jz" confuses me. The intent seems to be to test if any
> exceptions remain, and use "fnclex" if not. That would make sense, since
> "fnclex" clears all exceptions. But since the second "jz" is a "jz" and
> not a "jnz", the "fnclex" path is used only if exceptions remain, and
> the slower "fldenv" path is used if none remain. Or am I reading this
> wrong?
> 
> Anyway, I implemented the logic that made sense to me in the C version.

I haven't looked at this in a long time and will have to look in more
detail later.

> What remains to be done? Well, looking at the list of assembler files,
> the only targets for a C conversion that remain (in i386) are the string
> functions. After that, it is time to clean up and submit patches.
> 
> Speaking of, how would you like those? One patch for everything, one
> patch per directory (i.e. one for thread, one for math, one for fenv,
> one for string), or one per functions group (the three precisions of
> each function), or one per function? I don't want to overwhelm you.

Probably split up somewhat so that things that would need discussion
separately or that could be conceptually right/wrong independent of
one another (not just minor bugs) are separate commits. After looking
at the commits in your repo more I can probably make a better
recommendation.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 15:00 ` Rich Felker
@ 2021-12-27 16:27   ` Markus Wichmann
  2021-12-27 16:30   ` Rich Felker
  2021-12-29 10:02   ` Markus Wichmann
  2 siblings, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-27 16:27 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Mon, Dec 27, 2021 at 10:00:11AM -0500, Rich Felker wrote:
> On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> > For the maths code,
>
> One thing I found right away on the math code: You can't use "=t" for
> output in a float or double variable unless the asm naturally
> guarantees there's no excess precision (e.g. fmod where the operation
> is exact and the precision is necessarily bounded by the input
> precision). Otherwise you need "=t" to a temporary long double output
> to then let the compiler convert down to float or double (e.g. at time
> of return; I think we're still using explicit cast there too in case
> some compilers get this wrong).
>

Oh yes, I noticed the missing conversion instructions in some places and
didn't in others. As I said, some cleanup is in order. The branch is not
currently ripe for merging (yet). The code was written over several days
in various stages of alertness and distraction.

> This was probably a compromise between the "likely" notation I don't
> like (reads as a predicate lik "if this is likely" rather than "if X,
> which is likely") and my rather odd suggestion of as_expected (if
> (as_expected(x))). :-P
>

I agree to the point, although likely()/unlikely() seems to be the
convention everywhere else. OK, so predict_X it is. Which is still in
libm.h, so I cannot use it anywhere else in the code.

> > Most of that code was straightforward, but some of the more complex
> > functions I am not sure about. What is up with __exp2l()? I can see that
> > expl() is calling it, but I'm not sure why. But its existence forced me
> > to employ a technique not used elsewhere in the code (that I could
> > find): A hidden alias. I vaguely recall that such hackery was rejected
> > before (on grounds of old binutils reacting badly to such magic), but I
> > don't really know what else I could have done. Or was the correct way to
> > make __exp2l() a hidden function with the actual implementation and
> > exp2l() (without the underscores) a weak alias?
>
> Here it doesn't really matter at all since they're both in the
> baseline C namespace.
>

I found later that you introduced the hidden double-underscore versions
of some symbols that are called directly from assembler elsewhere to get
rid of things that would be textrels if not for a linker option. So that
entire point is moot with the move to C, anyway. I'll remove those in
the math code.

In the string code, I have currently defined __memcpy_fwd this way. Not
sure what to do there, yet, and the way I'm using the definition in
memmove() might invoke undefined behavior, because of
declaration/definition mismatch. Or do I define __memcpy_fwd without
"restrict"?

> So I think
> having them all share an inline function is fine. Of course since
> there's then no reason for them to be in the same source file, they
> should probably be moved to their corresponding named files instead of
> empty .c files.
>

So, you want me to move the core function to a header file as static
inline? That is possible and avoids the source duplication problem. I
was also wondering about those functions where the smaller-precision
versions just do what the long-double version is doing with maybe some
pre-/post-processing. Should they receive the same treatment or should
they just call the long-double version? E.g. acos, asin, log2, &c. Not
sure if the savings from not having to do a proper call/return are worth
the hassle of a new file with some of these.

> > Speaking of, how would you like those? One patch for everything, one
> > patch per directory (i.e. one for thread, one for math, one for fenv,
> > one for string), or one per functions group (the three precisions of
> > each function), or one per function? I don't want to overwhelm you.
>
> Probably split up somewhat so that things that would need discussion
> separately or that could be conceptually right/wrong independent of
> one another (not just minor bugs) are separate commits. After looking
> at the commits in your repo more I can probably make a better
> recommendation.
>

OK, I'll use my judgment when the time comes.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 15:00 ` Rich Felker
  2021-12-27 16:27   ` Markus Wichmann
@ 2021-12-27 16:30   ` Rich Felker
  2021-12-27 18:04     ` Markus Wichmann
  2021-12-29 10:02   ` Markus Wichmann
  2 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2021-12-27 16:30 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Mon, Dec 27, 2021 at 10:00:11AM -0500, Rich Felker wrote:
> On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> > For the maths code,
> 
> One thing I found right away on the math code: You can't use "=t" for
> output in a float or double variable unless the asm naturally
> guarantees there's no excess precision (e.g. fmod where the operation
> is exact and the precision is necessarily bounded by the input
> precision). Otherwise you need "=t" to a temporary long double output
> to then let the compiler convert down to float or double (e.g. at time
> of return; I think we're still using explicit cast there too in case
> some compilers get this wrong).

One thought, and I'm not sure if this is a good idea or a bad one but
worth discussing:

Using your acos.c as an example, where you have the comment:

	atan2(fabs(sqrt((1-x)*(1+x))), x)

The actual code could be written as:

	return (double)x87_fpatan(x, x87_fabs(x87_fsqrt((1-x)*(1+x))));

with the appropriate "x87.h" defining each of these with the
appropriate asm & constraints. This kinda makes the individual
functions self-documenting and non-error-prone (repetition of
error-prone constraints, especially the hidden requirement that, in
"=t"(x), x have type long double).

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 16:30   ` Rich Felker
@ 2021-12-27 18:04     ` Markus Wichmann
  2021-12-27 18:41       ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Markus Wichmann @ 2021-12-27 18:04 UTC (permalink / raw)
  To: musl

On Mon, Dec 27, 2021 at 11:30:56AM -0500, Rich Felker wrote:
> One thought, and I'm not sure if this is a good idea or a bad one but
> worth discussing:
>
> Using your acos.c as an example, where you have the comment:
>
> 	atan2(fabs(sqrt((1-x)*(1+x))), x)
>

That comment was copied from acos.s. In general, I have tried to
preserve comments. Except in fenv.s, where each time __hwcap was tested,
the same comment was prefixed, and its point should be coming across ten
times more easily by just creating a symbolic constant.

> The actual code could be written as:
>
> 	return (double)x87_fpatan(x, x87_fabs(x87_fsqrt((1-x)*(1+x))));
>
> with the appropriate "x87.h" defining each of these with the
> appropriate asm & constraints. This kinda makes the individual
> functions self-documenting and non-error-prone (repetition of
> error-prone constraints, especially the hidden requirement that, in
> "=t"(x), x have type long double).
>

That's probably an even better idea than what I am currently doing:
Moving the "core" functions into a new header file (as static inline
functions), and using these in the function implementations. I could not
get all the duplication out; in some cases the duplication is only
conceptual (hypot() and hypotf() have the same idea, but it needs to be
implemented differently due to the different precisions/representations).

I think I can combine both approaches, because what I'm doing appears to
have the effect of moving the __asm__ statements entirely out of the C
files into the new header file. And it appears that we are only using a
couple of instructions, anyway.

Downside is that implementing

static inline long double x87_fabs(long double x) {
    __asm__("fabs" : "+t"(x));
    return x;
}

now actually carries the connotation that the result is of
double-extended precision and needs rounding before being returned.
Unlike the current version which does not do that. However, to my
knowledge that will not actually be wrong, only slower, so a solution
that preserves the current connotations for these few instructions can
probably be considered a micro-optimization.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 18:04     ` Markus Wichmann
@ 2021-12-27 18:41       ` Rich Felker
  2021-12-28  8:42         ` Markus Wichmann
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2021-12-27 18:41 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Mon, Dec 27, 2021 at 07:04:37PM +0100, Markus Wichmann wrote:
> On Mon, Dec 27, 2021 at 11:30:56AM -0500, Rich Felker wrote:
> > One thought, and I'm not sure if this is a good idea or a bad one but
> > worth discussing:
> >
> > Using your acos.c as an example, where you have the comment:
> >
> > 	atan2(fabs(sqrt((1-x)*(1+x))), x)
> >
> 
> That comment was copied from acos.s. In general, I have tried to
> preserve comments. Except in fenv.s, where each time __hwcap was tested,
> the same comment was prefixed, and its point should be coming across ten
> times more easily by just creating a symbolic constant.
> 
> > The actual code could be written as:
> >
> > 	return (double)x87_fpatan(x, x87_fabs(x87_fsqrt((1-x)*(1+x))));
> >
> > with the appropriate "x87.h" defining each of these with the
> > appropriate asm & constraints. This kinda makes the individual
> > functions self-documenting and non-error-prone (repetition of
> > error-prone constraints, especially the hidden requirement that, in
> > "=t"(x), x have type long double).
> >
> 
> That's probably an even better idea than what I am currently doing:
> Moving the "core" functions into a new header file (as static inline
> functions), and using these in the function implementations. I could not
> get all the duplication out; in some cases the duplication is only
> conceptual (hypot() and hypotf() have the same idea, but it needs to be
> implemented differently due to the different precisions/representations).
> 
> I think I can combine both approaches, because what I'm doing appears to
> have the effect of moving the __asm__ statements entirely out of the C
> files into the new header file. And it appears that we are only using a
> couple of instructions, anyway.
> 
> Downside is that implementing
> 
> static inline long double x87_fabs(long double x) {
>     __asm__("fabs" : "+t"(x));
>     return x;
> }
> 
> now actually carries the connotation that the result is of
> double-extended precision and needs rounding before being returned.
> Unlike the current version which does not do that. However, to my
> knowledge that will not actually be wrong, only slower, so a solution
> that preserves the current connotations for these few instructions can
> probably be considered a micro-optimization.

Yes, I think the insns that can emit other precisions probably would
need 3 versions, but there are very few of these -- just fabs and
fmod?

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 18:41       ` Rich Felker
@ 2021-12-28  8:42         ` Markus Wichmann
  0 siblings, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-28  8:42 UTC (permalink / raw)
  To: musl; +Cc: musl

On Mon, Dec 27, 2021 at 01:41:38PM -0500, Rich Felker wrote:
> Yes, I think the insns that can emit other precisions probably would
> need 3 versions, but there are very few of these -- just fabs and
> fmod?
>
> Rich

x87 does not have an fmod instruction. There are fprem and fprem1, and
the difference escapes me at the moment. Anyway, frndint is also used in
this way. But for now, I have decided to just leave all of them as they
are. The fabs functions could not be any simpler, and it is unlikely the
remainder functions would benefit from this treatment, either, because
of the use of the status word. In the end I would create functions that
are used precisely once, and don't even clean up the function where they
are used all that much.

This means only the remquo functions suffer the penalty of superfluous
rounding. But in that case, I think the benefit of having only one place
with the implementation of the status word bit shuffle is actually worth
it. So the only thing I would even consider right now is to go back into
the remainder functions and replace the "0x400" with a symbolic
constant, now that there is one (X87_SW_C2).

Beyond that, I cannot think of what to change about the maths code any
more. String code needs more love, however.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] ASM-to-C conversion for i386
  2021-12-27 15:00 ` Rich Felker
  2021-12-27 16:27   ` Markus Wichmann
  2021-12-27 16:30   ` Rich Felker
@ 2021-12-29 10:02   ` Markus Wichmann
  2 siblings, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2021-12-29 10:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Mon, Dec 27, 2021 at 10:00:11AM -0500, Rich Felker wrote:
> On Sun, Dec 26, 2021 at 09:42:38PM +0100, Markus Wichmann wrote:
> > So I've converted __set_thread_area(). That was pretty straightforward
> > once I found SYSCALL_NO_TLS. The generated assembly generated by clang
> > 6.0.0 hits the same notes as the handwritten code, so I'm willing to
> > count that as a win.
>
> Somewhere I had an old version of this which didn't fare so well, so
> it will be interesting to compare.
>

Not quite sure what you mean by "didn't fare so well". I just fixed a
bug, and looked at the disassembly. Here's clang's output
(v6.0.0-ubuntu):


0005a482 <__set_thread_area>:
   5a482:	53                   	push   %ebx
   5a483:	56                   	push   %esi
   5a484:	83 ec 10             	sub    $0x10,%esp
   5a487:	e8 00 00 00 00       	call   5a48c <__set_thread_area+0xa>
   5a48c:	59                   	pop    %ecx
   5a48d:	81 c1 74 2b 04 00    	add    $0x42b74,%ecx
   5a493:	8b 44 24 1c          	mov    0x1c(%esp),%eax
   5a497:	8b b1 74 02 00 00    	mov    0x274(%ecx),%esi
   5a49d:	89 e2                	mov    %esp,%edx
   5a49f:	89 32                	mov    %esi,(%edx)
   5a4a1:	89 42 04             	mov    %eax,0x4(%edx)
   5a4a4:	c7 42 08 ff ff 0f 00 	movl   $0xfffff,0x8(%edx)
   5a4ab:	c7 42 0c 51 00 00 00 	movl   $0x51,0xc(%edx)
   5a4b2:	b8 f3 00 00 00       	mov    $0xf3,%eax
   5a4b7:	87 da                	xchg   %ebx,%edx
   5a4b9:	cd 80                	int    $0x80
   5a4bb:	87 da                	xchg   %ebx,%edx
   5a4bd:	85 c0                	test   %eax,%eax
   5a4bf:	74 22                	je     5a4e3 <__set_thread_area+0x61>
   5a4c1:	31 f6                	xor    %esi,%esi
   5a4c3:	46                   	inc    %esi
   5a4c4:	89 e1                	mov    %esp,%ecx
   5a4c6:	b8 7b 00 00 00       	mov    $0x7b,%eax
   5a4cb:	ba 10 00 00 00       	mov    $0x10,%edx
   5a4d0:	89 f3                	mov    %esi,%ebx
   5a4d2:	cd 80                	int    $0x80
   5a4d4:	b9 07 00 00 00       	mov    $0x7,%ecx
   5a4d9:	85 c0                	test   %eax,%eax
   5a4db:	74 18                	je     5a4f5 <__set_thread_area+0x73>
   5a4dd:	89 c6                	mov    %eax,%esi
   5a4df:	79 14                	jns    5a4f5 <__set_thread_area+0x73>
   5a4e1:	eb 14                	jmp    5a4f7 <__set_thread_area+0x75>
   5a4e3:	8b 04 24             	mov    (%esp),%eax
   5a4e6:	89 81 74 02 00 00    	mov    %eax,0x274(%ecx)
   5a4ec:	8d 0c c5 03 00 00 00 	lea    0x3(,%eax,8),%ecx
   5a4f3:	31 f6                	xor    %esi,%esi
   5a4f5:	8e e9                	mov    %ecx,%gs
   5a4f7:	89 f0                	mov    %esi,%eax
   5a4f9:	83 c4 10             	add    $0x10,%esp
   5a4fc:	5e                   	pop    %esi
   5a4fd:	5b                   	pop    %ebx
   5a4fe:	c3                   	ret

And here's gcc's (version 7.5.0-3ubuntu1):

00054908 <__set_thread_area>:
   54908:	56                   	push   %esi
   54909:	53                   	push   %ebx
   5490a:	e8 b0 3a fc ff       	call   183bf <__x86.get_pc_thunk.bx>
   5490f:	81 c3 f1 16 04 00    	add    $0x416f1,%ebx
   54915:	83 ec 10             	sub    $0x10,%esp
   54918:	8b 83 fc 02 00 00    	mov    0x2fc(%ebx),%eax
   5491e:	89 e1                	mov    %esp,%ecx
   54920:	c7 44 24 08 ff ff 0f 	movl   $0xfffff,0x8(%esp)
   54927:	00
   54928:	c7 44 24 0c 51 00 00 	movl   $0x51,0xc(%esp)
   5492f:	00
   54930:	89 ca                	mov    %ecx,%edx
   54932:	89 04 24             	mov    %eax,(%esp)
   54935:	8b 44 24 1c          	mov    0x1c(%esp),%eax
   54939:	89 44 24 04          	mov    %eax,0x4(%esp)
   5493d:	b8 f3 00 00 00       	mov    $0xf3,%eax
   54942:	87 da                	xchg   %ebx,%edx
   54944:	cd 80                	int    $0x80
   54946:	87 da                	xchg   %ebx,%edx
   54948:	85 c0                	test   %eax,%eax
   5494a:	75 14                	jne    54960 <__set_thread_area+0x58>
   5494c:	8b 04 24             	mov    (%esp),%eax
   5494f:	89 83 fc 02 00 00    	mov    %eax,0x2fc(%ebx)
   54955:	8d 34 c5 03 00 00 00 	lea    0x3(,%eax,8),%esi
   5495c:	31 c0                	xor    %eax,%eax
   5495e:	eb 23                	jmp    54983 <__set_thread_area+0x7b>
   54960:	b8 7b 00 00 00       	mov    $0x7b,%eax
   54965:	bb 01 00 00 00       	mov    $0x1,%ebx
   5496a:	ba 10 00 00 00       	mov    $0x10,%edx
   5496f:	cd 80                	int    $0x80
   54971:	85 c0                	test   %eax,%eax
   54973:	74 04                	je     54979 <__set_thread_area+0x71>
   54975:	78 0e                	js     54985 <__set_thread_area+0x7d>
   54977:	eb 0a                	jmp    54983 <__set_thread_area+0x7b>
   54979:	be 07 00 00 00       	mov    $0x7,%esi
   5497e:	b8 01 00 00 00       	mov    $0x1,%eax
   54983:	8e ee                	mov    %esi,%gs
   54985:	83 c4 10             	add    $0x10,%esp
   54988:	5b                   	pop    %ebx
   54989:	5e                   	pop    %esi
   5498a:	c3                   	ret

So no, it's not the same as the hand-written assembler (but then, I
don't think there's a way to make a C compiler emit that code), but it
does hit the right notes. The system calls are called with the right
values, and the correct thing is moved into GS in the end. And isn't
that what counts?

Not quite sure why that is a 32-bit move, when I explicitly requested
the operand be a 16-bit value, but in this case it does not matter, as
the high sixteen bits are discarded in that instruction.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-12-29 20:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-26 20:42 [musl] ASM-to-C conversion for i386 Markus Wichmann
2021-12-26 21:20 ` Markus Wichmann
2021-12-27 13:08 ` Markus Wichmann
2021-12-27 15:00 ` Rich Felker
2021-12-27 16:27   ` Markus Wichmann
2021-12-27 16:30   ` Rich Felker
2021-12-27 18:04     ` Markus Wichmann
2021-12-27 18:41       ` Rich Felker
2021-12-28  8:42         ` Markus Wichmann
2021-12-29 10:02   ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).