From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ocaml-erikd@mega-nerd.com>
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id C2227BC75
	for <caml-list@yquem.inria.fr>; Tue, 22 Feb 2005 08:23:13 +0100 (CET)
Received: from smtp.syd.swiftdsl.com.au (smtp.syd.swiftdsl.com.au [218.214.224.138])
	by nez-perce.inria.fr (8.13.0/8.13.0) with SMTP id j1M7NAN2007992
	for <caml-list@yquem.inria.fr>; Tue, 22 Feb 2005 08:23:12 +0100
Received: (qmail 17417 invoked from network); 22 Feb 2005 07:23:21 -0000
Received: from unknown (HELO coltrane.mega-nerd.net) (218.214.64.136)
  by smtp.syd.swiftdsl.com.au with SMTP; 22 Feb 2005 07:23:21 -0000
Received: from coltrane (localhost [127.0.0.1])
	by coltrane.mega-nerd.net (Postfix) with SMTP id 5F9437AD7
	for <caml-list@yquem.inria.fr>; Tue, 22 Feb 2005 18:23:07 +1100 (EST)
Date: Tue, 22 Feb 2005 18:23:07 +1100
From: Erik de Castro Lopo <ocaml-erikd@mega-nerd.com>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Need for a built in round_to_int function
Message-Id: <20050222182307.39ae9854.ocaml-erikd@mega-nerd.com>
In-Reply-To: <20050221160023.GA4759@yquem.inria.fr>
References: <20050221072255.29055ee4.ocaml-erikd@mega-nerd.com>
	<20050221225432.4f15c5e5.ocaml-erikd@mega-nerd.com>
	<20050221160023.GA4759@yquem.inria.fr>
Organization: Erik Conspiracy Secret Labs
X-Mailer: Sylpheed version 1.0.0 (GTK+ 1.2.10; i386-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Miltered: at nez-perce with ID 421ADDDE.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 wrote:01 hacked:01 ocamlopt:01 ocamlopt:01 instr:01 slowest:01 re-writing:01 nospam:98 compile:01 slower:01 slower:01 floats:01 essentially:01 algorithm:01 
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on yquem.inria.fr
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.2
X-Spam-Level: 

On Mon, 21 Feb 2005 17:00:23 +0100
Xavier Leroy <Xavier.Leroy@inria.fr> wrote:

> On the other hand, according to the P4 optimization manuals, the P4 is
> supposed to special-case this particular use of fnstcw / fldcw, so
> perhaps the situation is no worse than on the P3.  

OK, I've just tested this. On P4 the performance hit of fnstcw / fldcw 
is not as bad as it is with P3, but its still significant:

Using this test program (compiled with my hacked version of ocamlopt):

     http://www.mega-nerd.com/tmp/round_to_int.ml

On a 450MHz P3:

    Time int_of_float : 5.970000
    Time round_to_int : 2.360000

On a 2.8GHz P4:

    Time int_of_float : 0.420000
    Time round_to_int : 0.260000

> Essentially zero :-(  Basically, this is a case where additional stuff is
> introduced in the machine-independent parts of ocamlopt and in every
> code generator just to work around the brain-dead x87 floating-point
> instruction set.

Obviously it is your decision, but I think round_to_int is a common 
enough operation to warrant its own function. The ISO C Standards
committee thought so.

>  Every other processor (as well as the SSE2 instr.set

Quite honestly I think the value of SSE and SSE2 are over sold.
There are certain algorthims which simply can't be made to run
as fast on SSE/SSE2 as they run on the x87 FPU.

For instance, my audio sample rate converter:

    http://www.mega-nerd.com/SRC/

If I compile this on a P3 with gcc-3.4 using -mfpmath=sse -msse,
the highest quality (and hence slowest) converter runs 50% slower 
than the x87 FPU version. I have also tried re-writing the algorithm 
in hand optimised SSE code. The best I could get (I'm not an assembler 
expert) was still 10% slower than the x87 FPU.

I have just now repeated my experiment by compiling SRC on a P4 
with -msse2 (-mfpmath=sse2 doesn't work), the converter runs 75% 
slower than the x87 FPU version.

> I spent a lot of time in the past trying to extract decent float
> performance out of the x87 instruction set,

<snip>

> Nowadays, I no longer care about
> performance for x87: users who want good float performance should
> simply use the x86-64 architecture (with SSE2 floats), 

I'd love to get my hands on one of these, but I really do doubt
that its performance will be much better than that of the P4. 
The main problem is that generating good SSE/SSE2 code from
a high level language is an order of magnitude more difficult
than generating code for the x87 FPU.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"Projects promoting programming in natural language are intrinsically
doomed to fail." -- Edsger Dijkstra