From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id C2227BC75 for ; Tue, 22 Feb 2005 08:23:13 +0100 (CET) Received: from smtp.syd.swiftdsl.com.au (smtp.syd.swiftdsl.com.au [218.214.224.138]) by nez-perce.inria.fr (8.13.0/8.13.0) with SMTP id j1M7NAN2007992 for ; Tue, 22 Feb 2005 08:23:12 +0100 Received: (qmail 17417 invoked from network); 22 Feb 2005 07:23:21 -0000 Received: from unknown (HELO coltrane.mega-nerd.net) (218.214.64.136) by smtp.syd.swiftdsl.com.au with SMTP; 22 Feb 2005 07:23:21 -0000 Received: from coltrane (localhost [127.0.0.1]) by coltrane.mega-nerd.net (Postfix) with SMTP id 5F9437AD7 for ; Tue, 22 Feb 2005 18:23:07 +1100 (EST) Date: Tue, 22 Feb 2005 18:23:07 +1100 From: Erik de Castro Lopo To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Need for a built in round_to_int function Message-Id: <20050222182307.39ae9854.ocaml-erikd@mega-nerd.com> In-Reply-To: <20050221160023.GA4759@yquem.inria.fr> References: <20050221072255.29055ee4.ocaml-erikd@mega-nerd.com> <20050221225432.4f15c5e5.ocaml-erikd@mega-nerd.com> <20050221160023.GA4759@yquem.inria.fr> Organization: Erik Conspiracy Secret Labs X-Mailer: Sylpheed version 1.0.0 (GTK+ 1.2.10; i386-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Miltered: at nez-perce with ID 421ADDDE.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 wrote:01 hacked:01 ocamlopt:01 ocamlopt:01 instr:01 slowest:01 re-writing:01 nospam:98 compile:01 slower:01 slower:01 floats:01 essentially:01 algorithm:01 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on yquem.inria.fr X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.2 X-Spam-Level: On Mon, 21 Feb 2005 17:00:23 +0100 Xavier Leroy wrote: > On the other hand, according to the P4 optimization manuals, the P4 is > supposed to special-case this particular use of fnstcw / fldcw, so > perhaps the situation is no worse than on the P3. OK, I've just tested this. On P4 the performance hit of fnstcw / fldcw is not as bad as it is with P3, but its still significant: Using this test program (compiled with my hacked version of ocamlopt): http://www.mega-nerd.com/tmp/round_to_int.ml On a 450MHz P3: Time int_of_float : 5.970000 Time round_to_int : 2.360000 On a 2.8GHz P4: Time int_of_float : 0.420000 Time round_to_int : 0.260000 > Essentially zero :-( Basically, this is a case where additional stuff is > introduced in the machine-independent parts of ocamlopt and in every > code generator just to work around the brain-dead x87 floating-point > instruction set. Obviously it is your decision, but I think round_to_int is a common enough operation to warrant its own function. The ISO C Standards committee thought so. > Every other processor (as well as the SSE2 instr.set Quite honestly I think the value of SSE and SSE2 are over sold. There are certain algorthims which simply can't be made to run as fast on SSE/SSE2 as they run on the x87 FPU. For instance, my audio sample rate converter: http://www.mega-nerd.com/SRC/ If I compile this on a P3 with gcc-3.4 using -mfpmath=sse -msse, the highest quality (and hence slowest) converter runs 50% slower than the x87 FPU version. I have also tried re-writing the algorithm in hand optimised SSE code. The best I could get (I'm not an assembler expert) was still 10% slower than the x87 FPU. I have just now repeated my experiment by compiling SRC on a P4 with -msse2 (-mfpmath=sse2 doesn't work), the converter runs 75% slower than the x87 FPU version. > I spent a lot of time in the past trying to extract decent float > performance out of the x87 instruction set, > Nowadays, I no longer care about > performance for x87: users who want good float performance should > simply use the x86-64 architecture (with SSE2 floats), I'd love to get my hands on one of these, but I really do doubt that its performance will be much better than that of the P4. The main problem is that generating good SSE/SSE2 code from a high level language is an order of magnitude more difficult than generating code for the x87 FPU. Erik -- +-----------------------------------------------------------+ Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid) +-----------------------------------------------------------+ "Projects promoting programming in natural language are intrinsically doomed to fail." -- Edsger Dijkstra