caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Need for a built in round_to_int function
@ 2005-02-20 20:22 Erik de Castro Lopo
  2005-02-21  0:00 ` [Caml-list] " Robert Roessler
  2005-02-21 11:54 ` Erik de Castro Lopo
  0 siblings, 2 replies; 9+ messages in thread
From: Erik de Castro Lopo @ 2005-02-20 20:22 UTC (permalink / raw)
  To: caml-list

Hi all,

I am about to port some code from C to O'caml. This code uses the 
C99 function :

    long int lrint (double d) ;

which performs rounding on the double and then converts that to
a long int.

In O'caml the only option seems to be:

    let round_to_int f = int_of_float (f +. 0.5) ;;

The problem is that this code on i386 produces really slow code:

    804b385:    dd 44 98 fc        fldl   0xfffffffc(%eax,%ebx,4)
    804b389:    de c1              faddp  %st,%st(1)
    804b38b:    83 ec 08           sub    $0x8,%esp
    804b38e:    d9 7c 24 04        fnstcw 0x4(%esp)
    804b392:    66 8b 44 24 04     mov    0x4(%esp),%ax
    804b397:    b4 0c              mov    $0xc,%ah
    804b399:    66 89 44 24 00     mov    %ax,0x0(%esp)
    804b39e:    d9 6c 24 00        fldcw  0x0(%esp)
    804b3a2:    db 1c 24           fistpl (%esp)
    804b3a5:    8b 04 24           mov    (%esp),%eax
    804b3a8:    d9 6c 24 04        fldcw  0x4(%esp)
    804b3ac:    83 c4 08           add    $0x8,%esp

The killer here is the two fldcw (floating point load control word)
instructions, around the fistpl (which actually does the float to int 
conversion). Loading the FP control work causes a flush of the FPU
pipeline. In code with a lot of floating point code interspersed
with a round to int, there can be a significant slow down due to
the fldcw instructions.

The lrint function in C, replaces all the above with one fistpl
and a single mov instruction and leaves the floating point
control word intact. In C code that moved from:

    (int) floor (f + 0.5)

to
    lrintf (f)

I have seen an up to 4 fold increase in speed.

I've looked at the code for the O'Caml compiler and I think I 
know how to implement this, at least for x86 and PowerPC, the two
architectures I have access to. If I was to supply a patch would
it be accepted?


I know other suggestions like this one :

    http://sardes.inrialpes.fr/~aschmitt/cwn/2003.11.18.html#1

were not viewed favourably, but the addition of a single function
with an explicit behaviour is a far neater solution.

Regards,
Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"There are two kinds of large software systems: those that evolved
from small systems and those that don't work."
-- Seen on slashdot.org, then quoted by amk


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-20 20:22 Need for a built in round_to_int function Erik de Castro Lopo
@ 2005-02-21  0:00 ` Robert Roessler
  2005-02-21  1:51   ` Erik de Castro Lopo
  2005-02-21 11:54 ` Erik de Castro Lopo
  1 sibling, 1 reply; 9+ messages in thread
From: Robert Roessler @ 2005-02-21  0:00 UTC (permalink / raw)
  To: Erik de Castro Lopo; +Cc: caml-list

Erik de Castro Lopo wrote:

> I am about to port some code from C to O'caml. This code uses the 
> C99 function :
> 
>     long int lrint (double d) ;
> 
> which performs rounding on the double and then converts that to
> a long int.
> 
> In O'caml the only option seems to be:
> 
>     let round_to_int f = int_of_float (f +. 0.5) ;;
> 
> The problem is that this code on i386 produces really slow code:
> 
>     804b385:    dd 44 98 fc        fldl   0xfffffffc(%eax,%ebx,4)
>     804b389:    de c1              faddp  %st,%st(1)
>     804b38b:    83 ec 08           sub    $0x8,%esp
>     804b38e:    d9 7c 24 04        fnstcw 0x4(%esp)
>     804b392:    66 8b 44 24 04     mov    0x4(%esp),%ax
>     804b397:    b4 0c              mov    $0xc,%ah
>     804b399:    66 89 44 24 00     mov    %ax,0x0(%esp)
>     804b39e:    d9 6c 24 00        fldcw  0x0(%esp)
>     804b3a2:    db 1c 24           fistpl (%esp)
>     804b3a5:    8b 04 24           mov    (%esp),%eax
>     804b3a8:    d9 6c 24 04        fldcw  0x4(%esp)
>     804b3ac:    83 c4 08           add    $0x8,%esp
> 
> The killer here is the two fldcw (floating point load control word)
> instructions, around the fistpl (which actually does the float to int 
> conversion). Loading the FP control work causes a flush of the FPU
> pipeline. In code with a lot of floating point code interspersed
> with a round to int, there can be a significant slow down due to
> the fldcw instructions.

I will preface this by a Slashdot-like "IANANA" (I Am Not A Numerical 
Analyst).

The above approach is more or less what you expect if you (as a 
compiler code generator) a) want to do rounding following C/C++ 
standards ("Truncate (toward 0)"), and b) make no assumption regarding 
the state of the IEEE hardware rounding setting...

> The lrint function in C, replaces all the above with one fistpl
> and a single mov instruction and leaves the floating point
> control word intact. In C code that moved from:
> 
>     (int) floor (f + 0.5)
> 
> to
>     lrintf (f)
> 
> I have seen an up to 4 fold increase in speed.

You, on the other hand, are willing to make an assumption regarding 
the hardware rounding mode - [presumably] that it is set to the 
power-on default of "Round to nearest, or to even if equidistant", 
which may not be unreasonable - it just needs to be explicit that this 
*is* the assumption, and that you have a way of verifying (or at least 
reason to believe) that other software components in your app's 
environment are not invalidating this assumption.

The fact that the default hardware rounding mode does NOT match "(int) 
floor (f + 0.5)" should also be mentioned... the "+ 0.5" attempts to 
do what the hardware would call "Round up (toward +infinity)" while 
the "floor" would match the "Round down (toward -infinity)" mode. 
Combining them does not equate to "Round to nearest, or to even if 
equidistant". :)

In case it isn't obvious, the IEEE hardware default rounding behavior 
is chosen to minimize the effects of accumulated rounding errors in a 
series of calculations involving rounding.

> I've looked at the code for the O'Caml compiler and I think I 
> know how to implement this, at least for x86 and PowerPC, the two
> architectures I have access to. If I was to supply a patch would
> it be accepted?
> 
> 
> I know other suggestions like this one :
> 
>     http://sardes.inrialpes.fr/~aschmitt/cwn/2003.11.18.html#1
> 
> were not viewed favourably, but the addition of a single function
> with an explicit behaviour is a far neater solution.

This could take the form of a compiler switch exactly like "/QIfist", 
which was added to VC7 (and VC6 with the "Processor Pack").  Using 
this switch means you are aware of (or should be) and happy with the 
above detailed assumption.

Of course, if something like this were to added to ocamlopt (for 
target architectures using IEEE floating point), code (an additional 
bytecode op?) emulating the same behavior could be added to the 
runtime to maintain consistency across the interpreted and native 
operating environments - or not.

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21  0:00 ` [Caml-list] " Robert Roessler
@ 2005-02-21  1:51   ` Erik de Castro Lopo
  2005-02-21  4:34     ` Robert Roessler
  0 siblings, 1 reply; 9+ messages in thread
From: Erik de Castro Lopo @ 2005-02-21  1:51 UTC (permalink / raw)
  To: caml-list

On Sun, 20 Feb 2005 16:00:53 -0800
Robert Roessler <roessler@rftp.com> wrote:

> I will preface this by a Slashdot-like "IANANA" (I Am Not A Numerical 
> Analyst).

Nor am I. I don't play one in a TV show either.

> The above approach is more or less what you expect if you (as a 
> compiler code generator) a) want to do rounding following C/C++ 
> standards ("Truncate (toward 0)"), and b) make no assumption regarding 
> the state of the IEEE hardware rounding setting...

Yes.

> You, on the other hand, are willing to make an assumption regarding 
> the hardware rounding mode - [presumably] that it is set to the 
> power-on default of "Round to nearest, or to even if equidistant", 
 > which may not be unreasonable - it just needs to be explicit that this 
> *is* the assumption, and that you have a way of verifying (or at least 
> reason to believe) that other software components in your app's 
> environment are not invalidating this assumption.

>From the lrint manpage on Linux:

    These functions round their argument  to  the  nearest  integer  value,
    using  the  current rounding direction.  If x is infinite or NaN, or if
    the rounded value is outside the range of the return type, the  numeric
    result  is unspecified.  A domain error may occur if the magnitude of x
    is too large.

I would suggest something similar for the proposed O'Caml function.

> The fact that the default hardware rounding mode does NOT match "(int) 
> floor (f + 0.5)" should also be mentioned... the "+ 0.5" attempts to 
> do what the hardware would call "Round up (toward +infinity)" 

Careful, that could very easily be confused with  the C functions ceil().

> This could take the form of a compiler switch exactly like "/QIfist", 

A compiler switch is significantly more complicated than my proposal
for a new built-in function with a well defined behaviour.

The compiler switch causes all int_to_float to behave like a round
instead of a truncate. My proposal allows a single file to have
both behaviours.

> Of course, if something like this were to added to ocamlopt (for 
> target architectures using IEEE floating point), code (an additional 
> bytecode op?) emulating the same behavior could be added to the 
> runtime to maintain consistency across the interpreted and native 
> operating environments - or not.

I believe that the bytecode interpreter simply uses the definitions
supplied by ocamlopt. Once ocamlopt has this functionality there
is not much else required to allow the interpretter to use the same
functionality.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"When you say "I wrote a program that crashed Windows", people
just stare at you blankly and say "Hey, I got those with the
system, *for free*." -- Linus Torvalds


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21  1:51   ` Erik de Castro Lopo
@ 2005-02-21  4:34     ` Robert Roessler
  2005-02-21  6:50       ` Erik de Castro Lopo
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Roessler @ 2005-02-21  4:34 UTC (permalink / raw)
  To: Erik de Castro Lopo; +Cc: caml-list

Erik de Castro Lopo wrote:

> On Sun, 20 Feb 2005 16:00:53 -0800
> Robert Roessler <roessler@rftp.com> wrote:
> ...
>>You, on the other hand, are willing to make an assumption regarding 
>>the hardware rounding mode - [presumably] that it is set to the 
>>power-on default of "Round to nearest, or to even if equidistant", 
>>which may not be unreasonable - it just needs to be explicit that this 
>>*is* the assumption, and that you have a way of verifying (or at least 
>>reason to believe) that other software components in your app's 
>>environment are not invalidating this assumption.
> 
> 
>>From the lrint manpage on Linux:
> 
>     These functions round their argument  to  the  nearest  integer  value,
>     using  the  current rounding direction.  If x is infinite or NaN, or if
>     the rounded value is outside the range of the return type, the  numeric
>     result  is unspecified.  A domain error may occur if the magnitude of x
>     is too large.
> 
> I would suggest something similar for the proposed O'Caml function.

Exactly like the "FIST[P]" instruction without explicit control word 
saving, setting and restoring - which is believable, since according 
to your first post, your reference implementation is done this way.

>>The fact that the default hardware rounding mode does NOT match "(int) 
>>floor (f + 0.5)" should also be mentioned... the "+ 0.5" attempts to 
>>do what the hardware would call "Round up (toward +infinity)" 
> 
> 
> Careful, that could very easily be confused with  the C functions ceil().

Umm, 2 of the IEEE hardware rounding modes *do* seem to match floor() 
and ceil()... it is probably not a coincidence. :)

>>This could take the form of a compiler switch exactly like "/QIfist", 
> 
> 
> A compiler switch is significantly more complicated than my proposal
> for a new built-in function with a well defined behaviour.
> 
> The compiler switch causes all int_to_float to behave like a round
> instead of a truncate. My proposal allows a single file to have
> both behaviours.

Actually, emulating the VC7 "/QIfist" does NOT cause "all int_to_float 
to behave like a round instead of a truncate" - it does exactly what 
we are already talking about: do the int_of_float with a "bare" 
FIST[P], operating in whatever rounding mode the hardware is already 
in (presumably one we want and expect it to be in).

>>Of course, if something like this were to added to ocamlopt (for 
>>target architectures using IEEE floating point), code (an additional 
>>bytecode op?) emulating the same behavior could be added to the 
>>runtime to maintain consistency across the interpreted and native 
>>operating environments - or not.
> 
> 
> I believe that the bytecode interpreter simply uses the definitions
> supplied by ocamlopt. Once ocamlopt has this functionality there
> is not much else required to allow the interpretter to use the same
> functionality.

The bytecode interpreter has nothing to do with ocamlopt... in fact, 
the primitive for doing int_of_float is precisely

return Val_long((long) Double_val(f));

which will perform a "truncate" operation for the conversion, since 
that is what the C standard requires. :)

Robert Roessler
roessler@rftp.com
http://www.rftp.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21  4:34     ` Robert Roessler
@ 2005-02-21  6:50       ` Erik de Castro Lopo
  0 siblings, 0 replies; 9+ messages in thread
From: Erik de Castro Lopo @ 2005-02-21  6:50 UTC (permalink / raw)
  To: caml-list

On Sun, 20 Feb 2005 20:34:14 -0800
Robert Roessler <roessler@rftp.com> wrote:

> Actually, emulating the VC7 "/QIfist" does NOT cause "all int_to_float 
> to behave like a round instead of a truncate" - it does exactly what 
> we are already talking about: do the int_of_float with a "bare" 
> FIST[P], operating in whatever rounding mode the hardware is already 
> in (presumably one we want and expect it to be in).

Implementing your proposal, means that int_of_float will change
behaviour depending on a compiler flag. I think thats a bad
idea.

My proposal leaves int_of_float just as it is and defines a new
function:

    val round_to_int : float -> int

which expand to a FISTPL instruction and one or two glue 
instructions.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
Good advice for everyone : stay away from churches, mosques and
synagouges.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-20 20:22 Need for a built in round_to_int function Erik de Castro Lopo
  2005-02-21  0:00 ` [Caml-list] " Robert Roessler
@ 2005-02-21 11:54 ` Erik de Castro Lopo
  2005-02-21 16:00   ` Xavier Leroy
  2005-02-21 16:01   ` Hendrik Tews
  1 sibling, 2 replies; 9+ messages in thread
From: Erik de Castro Lopo @ 2005-02-21 11:54 UTC (permalink / raw)
  To: caml-list

On Mon, 21 Feb 2005 07:22:55 +1100
Erik de Castro Lopo <ocaml-erikd@mega-nerd.com> wrote:

> I've looked at the code for the O'Caml compiler and I think I 
> know how to implement this, at least for x86 and PowerPC, the two
> architectures I have access to. If I was to supply a patch would
> it be accepted?

OK, I have a patch that seems to do the right thing on x86 and
PowerPC:

    http://www.mega-nerd.com/tmp/round_to_int.diff

I'm not 100% happy with the mods to byterun/floats.c. In order to
use the C99 lrintf() function in caml_round_to_int() I had to
define _ISOC9X_SOURCE etc before including <math.h>. The real
solution (when using GCC) is to add -std=c99 to the command line.

Using the following test program:

    http://www.mega-nerd.com/tmp/round_to_int.ml

with the hacked ocamlopt compiler I'm getting about a 3 times speed
improvement using round_to_int on a Pentium III. The speedup on P4
is supposed to be even more because of the P4's deeper pipeline.
On PowerPC, there is no speedup because PowerPC has two separate
instructions for float to int, one which rounds and one which 
truncates.

I'd be interested in any comments from the official ocaml maintainers.
What are the chances of this going into the official distribution?

I also have access to a machine with a MIPS CPU. I don't know
MIPS assembler but I'll see what I can do.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"I saw `cout' being shifted "Hello world" times to the left and
stopped right there." -- Steve Gonedes


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21 11:54 ` Erik de Castro Lopo
@ 2005-02-21 16:00   ` Xavier Leroy
  2005-02-22  7:23     ` Erik de Castro Lopo
  2005-02-21 16:01   ` Hendrik Tews
  1 sibling, 1 reply; 9+ messages in thread
From: Xavier Leroy @ 2005-02-21 16:00 UTC (permalink / raw)
  To: Erik de Castro Lopo; +Cc: caml-list

> [ "round float to int" primitive ]
> with the hacked ocamlopt compiler I'm getting about a 3 times speed
> improvement using round_to_int on a Pentium III. The speedup on P4
> is supposed to be even more because of the P4's deeper pipeline.

On the other hand, according to the P4 optimization manuals, the P4 is
supposed to special-case this particular use of fnstcw / fldcw, so
perhaps the situation is no worse than on the P3.  At any rate:

> I'd be interested in any comments from the official ocaml maintainers.
> What are the chances of this going into the official distribution?

Essentially zero :-(  Basically, this is a case where additional stuff is
introduced in the machine-independent parts of ocamlopt and in every
code generator just to work around the brain-dead x87 floating-point
instruction set.  Every other processor (as well as the SSE2 instr.set
on the P4) has a fast "truncate float to int" instruction, from which
an efficient "round to nearest int" is easy to obtain (truncate (x +. 0.5)).  

I spent a lot of time in the past trying to extract decent float
performance out of the x87 instruction set, under the strict
constraint that these performance hacks should be confined to the
x86-specific parts of ocamlopt.  Nowadays, I no longer care about
performance for x87: users who want good float performance should
simply use the x86-64 architecture (with SSE2 floats), it's cheaply
available from both AMD and Intel and works so much better for
floats...  At the extreme, I'd rather do a x86+SSE2 port (for P4 and
P3M-class processors) than spend more time on x86+x87.

- Xavier Leroy


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21 11:54 ` Erik de Castro Lopo
  2005-02-21 16:00   ` Xavier Leroy
@ 2005-02-21 16:01   ` Hendrik Tews
  1 sibling, 0 replies; 9+ messages in thread
From: Hendrik Tews @ 2005-02-21 16:01 UTC (permalink / raw)
  To: caml-list

Erik de Castro Lopo <ocaml-erikd@mega-nerd.com> writes:

   I'd be interested in any comments from the official ocaml maintainers.

You should file your patch as a feature wish in the bug data
base. This has also been recommended several times by the
"officials".

Bye,

Hendrik Tews


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] Need for a built in round_to_int function
  2005-02-21 16:00   ` Xavier Leroy
@ 2005-02-22  7:23     ` Erik de Castro Lopo
  0 siblings, 0 replies; 9+ messages in thread
From: Erik de Castro Lopo @ 2005-02-22  7:23 UTC (permalink / raw)
  To: caml-list

On Mon, 21 Feb 2005 17:00:23 +0100
Xavier Leroy <Xavier.Leroy@inria.fr> wrote:

> On the other hand, according to the P4 optimization manuals, the P4 is
> supposed to special-case this particular use of fnstcw / fldcw, so
> perhaps the situation is no worse than on the P3.  

OK, I've just tested this. On P4 the performance hit of fnstcw / fldcw 
is not as bad as it is with P3, but its still significant:

Using this test program (compiled with my hacked version of ocamlopt):

     http://www.mega-nerd.com/tmp/round_to_int.ml

On a 450MHz P3:

    Time int_of_float : 5.970000
    Time round_to_int : 2.360000

On a 2.8GHz P4:

    Time int_of_float : 0.420000
    Time round_to_int : 0.260000

> Essentially zero :-(  Basically, this is a case where additional stuff is
> introduced in the machine-independent parts of ocamlopt and in every
> code generator just to work around the brain-dead x87 floating-point
> instruction set.

Obviously it is your decision, but I think round_to_int is a common 
enough operation to warrant its own function. The ISO C Standards
committee thought so.

>  Every other processor (as well as the SSE2 instr.set

Quite honestly I think the value of SSE and SSE2 are over sold.
There are certain algorthims which simply can't be made to run
as fast on SSE/SSE2 as they run on the x87 FPU.

For instance, my audio sample rate converter:

    http://www.mega-nerd.com/SRC/

If I compile this on a P3 with gcc-3.4 using -mfpmath=sse -msse,
the highest quality (and hence slowest) converter runs 50% slower 
than the x87 FPU version. I have also tried re-writing the algorithm 
in hand optimised SSE code. The best I could get (I'm not an assembler 
expert) was still 10% slower than the x87 FPU.

I have just now repeated my experiment by compiling SRC on a P4 
with -msse2 (-mfpmath=sse2 doesn't work), the converter runs 75% 
slower than the x87 FPU version.

> I spent a lot of time in the past trying to extract decent float
> performance out of the x87 instruction set,

<snip>

> Nowadays, I no longer care about
> performance for x87: users who want good float performance should
> simply use the x86-64 architecture (with SSE2 floats), 

I'd love to get my hands on one of these, but I really do doubt
that its performance will be much better than that of the P4. 
The main problem is that generating good SSE/SSE2 code from
a high level language is an order of magnitude more difficult
than generating code for the x87 FPU.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"Projects promoting programming in natural language are intrinsically
doomed to fail." -- Edsger Dijkstra


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-02-22  7:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-20 20:22 Need for a built in round_to_int function Erik de Castro Lopo
2005-02-21  0:00 ` [Caml-list] " Robert Roessler
2005-02-21  1:51   ` Erik de Castro Lopo
2005-02-21  4:34     ` Robert Roessler
2005-02-21  6:50       ` Erik de Castro Lopo
2005-02-21 11:54 ` Erik de Castro Lopo
2005-02-21 16:00   ` Xavier Leroy
2005-02-22  7:23     ` Erik de Castro Lopo
2005-02-21 16:01   ` Hendrik Tews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).