From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA06968; Tue, 21 Jan 2003 10:56:51 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA06748 for ; Tue, 21 Jan 2003 10:56:50 +0100 (MET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h0L9umr01093; Tue, 21 Jan 2003 10:56:48 +0100 (MET) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA06984; Tue, 21 Jan 2003 10:56:48 +0100 (MET) Date: Tue, 21 Jan 2003 10:56:48 +0100 From: Xavier Leroy To: Oleg Cc: caml-list@inria.fr Subject: [Caml-list] Re: Coyote Gulch test in Caml Message-ID: <20030121105648.A5543@pauillac.inria.fr> References: <3E15B3B3.3040106@163.com> <20030103071042.T22850@speakeasy.org> <20030104193118.A26208@pauillac.inria.fr> <200301181749.48295.oleg_inconnu@myrealbox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <200301181749.48295.oleg_inconnu@myrealbox.com>; from oleg_inconnu@myrealbox.com on Sat, Jan 18, 2003 at 05:49:47PM -0500 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > On Saturday 04 January 2003 01:31 pm, Xavier Leroy wrote: > > Apparently, the ocamlopt-generated code > > offers less instruction-level parallelism than the g++-generated code > > for the float computations. Still, I haven't really understood where > > the factor of 2 comes from. Oleg asks: > It's been a couple of weeks. I'm wondering if you got any new insights into > this? Yes: I'm just back from a trip to the US and had plenty of time to kill during the transatlantic flights :-) Apparently, one cause of inefficiency is excessive storing of float results in memory temporaries. The x86 is a wierd beast: while loading floats from memory is quite fast (almost as fast as using a float already on the register stack), storing (the fstp instruction) seems to be quite expensive. Fortunately, a small modification to the ocamlopt x86 code generator can remove many of these stores to temporaries in the case of the Almabench test. With this modification, the OCaml code runs at 2/3 the speed of the code generated by g++ -O3, which is still not great but more in-line with previous numerical benchmarks. I also played with a "-ffast-math" flag for ocamlopt, whereas some math functions (sin, cos, sqrt, log) are directly expanded into x86 instructions. With this, we get 85% of the performance of g++ -O3, which isn't bad, and 2/3 of the performance of g++ -O3 -ffast-math. At any rate, the changes above to the OCaml code generator need to be tested more before possible inclusion in the next release. Never trust code that you wrote in an airplane, especially while fighting for the armrest with an elderly central European lady who doesn't understand any of the languages that you speak :-) > Just as wild guess: the code contains calls to "sin" and "cos" on the same > value. Perhaps GCC manages to optimize those into one call to "sincos" No, gcc doesn't do that. But perhaps the Intel compiler does. David Chase warns: > Just a silly question, but if you want sin and cos to go faster, > how much accuracy are you willing to trade away for improved > performance? Just for example, by using the Pentium instructions, > you reduce the number of (accurate) significant bits in the result > from 53 (IEEE double) to 13 (for some inputs between zero and 2*PI). > (If you are using 64-bit mantissas, the worst case is only 4 bits of > accuracy.) I didn't know that. At any rate, the sin() and cos() functions from the Linux libm probably suffer from this loss of precision too, because they are of the following form: cos: fcos instruction if operand was in the [-2^64,2^64] range, return reduce operand modulo 2pi fcos instruction return Hence, using fcos rather than calling cos() should give the same (not very exact) result as long as the operand is in the [-2^64,2^64] range, and return a nonsensical result otherwise. Nickolay Semyonov-Kolchin asks: > But then this brings up the issue of conformity vr.s performance. For > example- the x86 has its 80-bit FP registers in 8087-legacy mode, but > 64-bit registers if you're using SSE2. And PowerPC and PA-RISC both have > extended precision fused multiply-adds (that keep higher precision, i.e. > don't round, between the multiply and the adds). ocamlopt uses 80-bit floats for intermediate results on the x86, and the multiply-add instruction on the PowerPC. It is true that this can cause the final results to differ from those of the bytecode compiler, which uses strict 64-bit float arithmetic, but I believe this is acceptable, both for the additional speed and because the result is "more exact" from a numerical analysis standpoint. > For that matter, could a > "conforming" implementation of Ocaml use the 32-bit single precision SSE-1 > registers? Using single-precision FP is questionable because of the important loss in precision. However, SSE-2 supports double precision arithmetic on SSE registers, and that could be an adequate target for ocamlopt-generated code. I plan to experiment with this soon. - Xavier Leroy ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners