[Caml-list] Re: Coyote Gulch test in Caml

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Xavier Leroy <xavier.leroy@inria.fr>
To: Oleg <oleg_inconnu@myrealbox.com>
Cc: caml-list@inria.fr
Subject: [Caml-list] Re: Coyote Gulch test in Caml
Date: Tue, 21 Jan 2003 10:56:48 +0100	[thread overview]
Message-ID: <20030121105648.A5543@pauillac.inria.fr> (raw)
In-Reply-To: <200301181749.48295.oleg_inconnu@myrealbox.com>; from oleg_inconnu@myrealbox.com on Sat, Jan 18, 2003 at 05:49:47PM -0500

> On Saturday 04 January 2003 01:31 pm, Xavier Leroy wrote:
> > Apparently, the ocamlopt-generated code
> > offers less instruction-level parallelism than the g++-generated code
> > for the float computations.  Still, I haven't really understood where
> > the factor of 2 comes from.  

Oleg asks:

> It's been a couple of weeks. I'm wondering if you got any new insights into 
> this?

Yes: I'm just back from a trip to the US and had plenty of time to
kill during the transatlantic flights :-)

Apparently, one cause of inefficiency is excessive storing of
float results in memory temporaries.  The x86 is a wierd beast: while
loading floats from memory is quite fast (almost as fast as using a
float already on the register stack), storing (the fstp instruction)
seems to be quite expensive.  

Fortunately, a small modification to the ocamlopt x86 code generator
can remove many of these stores to temporaries in the case of
the Almabench test.   With this modification, the OCaml code runs at
2/3 the speed of the code generated by g++ -O3, which is still not
great but more in-line with previous numerical benchmarks.

I also played with a "-ffast-math" flag for ocamlopt, whereas some math
functions (sin, cos, sqrt, log) are directly expanded into x86
instructions.  With this, we get 85% of the performance of g++ -O3,
which isn't bad, and 2/3 of the performance of g++ -O3 -ffast-math.

At any rate, the changes above to the OCaml code generator need to be
tested more before possible inclusion in the next release.  Never
trust code that you wrote in an airplane, especially while fighting
for the armrest with an elderly central European lady who doesn't
understand any of the languages that you speak :-)

> Just as wild guess: the code contains calls to "sin" and "cos" on the same 
> value. Perhaps GCC manages to optimize those into one call to "sincos"

No, gcc doesn't do that.  But perhaps the Intel compiler does.

David Chase warns:

> Just a silly question, but if you want sin and cos to go faster,
> how much accuracy are you willing to trade away for improved
> performance?  Just for example, by using the Pentium instructions,
> you reduce the number of (accurate) significant bits in the result
> from 53 (IEEE double) to 13 (for some inputs between zero and 2*PI).
> (If you are using 64-bit mantissas, the worst case is only 4 bits of
> accuracy.)

I didn't know that.  At any rate, the sin() and cos() functions from
the Linux libm probably suffer from this loss of precision too,
because they are of the following form:

cos:    fcos instruction
        if operand was in the [-2^64,2^64] range, return
        reduce operand modulo 2pi
        fcos instruction
        return

Hence, using fcos rather than calling cos() should give the same (not
very exact) result as long as the operand is in the [-2^64,2^64] range,
and return a nonsensical result otherwise.

Nickolay Semyonov-Kolchin asks:

> But then this brings up the issue of conformity vr.s performance.  For
> example- the x86 has its 80-bit FP registers in 8087-legacy mode, but
> 64-bit registers if you're using SSE2.  And PowerPC and PA-RISC both have
> extended precision fused multiply-adds (that keep higher precision, i.e.
> don't round, between the multiply and the adds).

ocamlopt uses 80-bit floats for intermediate results on the x86, and
the multiply-add instruction on the PowerPC.  It is true that this can
cause the final results to differ from those of the bytecode compiler,
which uses strict 64-bit float arithmetic, but I believe this is
acceptable, both for the additional speed and because the result is
"more exact" from a numerical analysis standpoint.

> For that matter, could a 
> "conforming" implementation of Ocaml use the 32-bit single precision SSE-1 
> registers?

Using single-precision FP is questionable because of the important
loss in precision.  However, SSE-2 supports double precision
arithmetic on SSE registers, and that could be an adequate target for
ocamlopt-generated code.  I plan to experiment with this soon.

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

next prev parent reply	other threads:[~2003-01-21  9:56 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-01-03 16:00 [Caml-list] speed onlyclimb
2003-01-03 11:38 ` [Caml-list] speed Clemens Hintze
2003-01-03 11:47 ` [Caml-list] speed Noel Welsh
2003-01-02 16:45   ` Chet Murthy
2003-01-03 13:32 ` Xavier Leroy
2003-01-02 17:52   ` Chet Murthy
2003-01-03 14:53     ` Sven Luther
2003-01-03 15:28       ` Erol Akarsu
2003-01-02 17:53   ` Coyote Gulch test in Caml (was Re: [Caml-list] speed ) Chet Murthy
2003-01-03 15:10     ` Shawn Wagner
2003-01-03 15:56       ` Oleg
2003-01-04 18:31       ` Xavier Leroy
2003-01-18 22:49         ` Oleg
2003-01-18 23:50           ` Shawn Wagner
2003-01-20 21:23             ` David Chase
2003-01-20 21:39               ` Nickolay Semyonov-Kolchin
2003-01-21  0:54                 ` Brian Hurt
2003-01-21 13:09                 ` David Chase
2003-01-21 13:15                   ` Daniel Andor
2003-01-21 20:26                   ` Nickolay Semyonov-Kolchin
2003-01-19 10:33           ` Siegfried Gonzi
2003-01-19 10:34           ` Siegfried Gonzi
2003-01-21  9:56           ` Xavier Leroy [this message]
2003-01-21 15:57             ` [Caml-list] Re: Coyote Gulch test in Caml Brian Hurt
2003-01-27 16:58             ` Daniel Andor
2003-01-28  8:27               ` Christian Lindig
2003-01-05  1:13   ` [Caml-list] speed Brian Hurt
2003-01-05  1:48     ` Michael Vanier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030121105648.A5543@pauillac.inria.fr \
    --to=xavier.leroy@inria.fr \
    --cc=caml-list@inria.fr \
    --cc=oleg_inconnu@myrealbox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).