From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA06968; Tue, 21 Jan 2003 10:56:51 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA06748 for <caml-list@pauillac.inria.fr>; Tue, 21 Jan 2003 10:56:50 +0100 (MET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h0L9umr01093;
	Tue, 21 Jan 2003 10:56:48 +0100 (MET)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA06984; Tue, 21 Jan 2003 10:56:48 +0100 (MET)
Date: Tue, 21 Jan 2003 10:56:48 +0100
From: Xavier Leroy <xavier.leroy@inria.fr>
To: Oleg <oleg_inconnu@myrealbox.com>
Cc: caml-list@inria.fr
Subject: [Caml-list] Re: Coyote Gulch test in Caml
Message-ID: <20030121105648.A5543@pauillac.inria.fr>
References: <3E15B3B3.3040106@163.com> <20030103071042.T22850@speakeasy.org> <20030104193118.A26208@pauillac.inria.fr> <200301181749.48295.oleg_inconnu@myrealbox.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <200301181749.48295.oleg_inconnu@myrealbox.com>; from oleg_inconnu@myrealbox.com on Sat, Jan 18, 2003 at 05:49:47PM -0500
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

> On Saturday 04 January 2003 01:31 pm, Xavier Leroy wrote:
> > Apparently, the ocamlopt-generated code
> > offers less instruction-level parallelism than the g++-generated code
> > for the float computations.  Still, I haven't really understood where
> > the factor of 2 comes from.  

Oleg asks:

> It's been a couple of weeks. I'm wondering if you got any new insights into 
> this?

Yes: I'm just back from a trip to the US and had plenty of time to
kill during the transatlantic flights :-)

Apparently, one cause of inefficiency is excessive storing of
float results in memory temporaries.  The x86 is a wierd beast: while
loading floats from memory is quite fast (almost as fast as using a
float already on the register stack), storing (the fstp instruction)
seems to be quite expensive.  

Fortunately, a small modification to the ocamlopt x86 code generator
can remove many of these stores to temporaries in the case of
the Almabench test.   With this modification, the OCaml code runs at
2/3 the speed of the code generated by g++ -O3, which is still not
great but more in-line with previous numerical benchmarks.

I also played with a "-ffast-math" flag for ocamlopt, whereas some math
functions (sin, cos, sqrt, log) are directly expanded into x86
instructions.  With this, we get 85% of the performance of g++ -O3,
which isn't bad, and 2/3 of the performance of g++ -O3 -ffast-math.

At any rate, the changes above to the OCaml code generator need to be
tested more before possible inclusion in the next release.  Never
trust code that you wrote in an airplane, especially while fighting
for the armrest with an elderly central European lady who doesn't
understand any of the languages that you speak :-)

> Just as wild guess: the code contains calls to "sin" and "cos" on the same 
> value. Perhaps GCC manages to optimize those into one call to "sincos"

No, gcc doesn't do that.  But perhaps the Intel compiler does.

David Chase warns:

> Just a silly question, but if you want sin and cos to go faster,
> how much accuracy are you willing to trade away for improved
> performance?  Just for example, by using the Pentium instructions,
> you reduce the number of (accurate) significant bits in the result
> from 53 (IEEE double) to 13 (for some inputs between zero and 2*PI).
> (If you are using 64-bit mantissas, the worst case is only 4 bits of
> accuracy.)

I didn't know that.  At any rate, the sin() and cos() functions from
the Linux libm probably suffer from this loss of precision too,
because they are of the following form:

cos:    fcos instruction
        if operand was in the [-2^64,2^64] range, return
        reduce operand modulo 2pi
        fcos instruction
        return

Hence, using fcos rather than calling cos() should give the same (not
very exact) result as long as the operand is in the [-2^64,2^64] range,
and return a nonsensical result otherwise.

Nickolay Semyonov-Kolchin asks:

> But then this brings up the issue of conformity vr.s performance.  For
> example- the x86 has its 80-bit FP registers in 8087-legacy mode, but
> 64-bit registers if you're using SSE2.  And PowerPC and PA-RISC both have
> extended precision fused multiply-adds (that keep higher precision, i.e.
> don't round, between the multiply and the adds).

ocamlopt uses 80-bit floats for intermediate results on the x86, and
the multiply-add instruction on the PowerPC.  It is true that this can
cause the final results to differ from those of the bytecode compiler,
which uses strict 64-bit float arithmetic, but I believe this is
acceptable, both for the additional speed and because the result is
"more exact" from a numerical analysis standpoint.

> For that matter, could a 
> "conforming" implementation of Ocaml use the 32-bit single precision SSE-1 
> registers?

Using single-precision FP is questionable because of the important
loss in precision.  However, SSE-2 supports double precision
arithmetic on SSE registers, and that could be an adequate target for
ocamlopt-generated code.  I plan to experiment with this soon.

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners