From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 6FC6DBBAF for ; Fri, 8 May 2009 12:21:39 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AisCAGukA0rUGyoEk2dsb2JhbACXEgEBAQEJCQoJEQO4JIN9BQ X-IronPort-AV: E=Sophos;i="4.40,316,1238968800"; d="scan'208";a="39643151" Received: from smtp4-g21.free.fr ([212.27.42.4]) by mail4-smtp-sop.national.inria.fr with ESMTP; 08 May 2009 12:21:37 +0200 Received: from smtp4-g21.free.fr (localhost [127.0.0.1]) by smtp4-g21.free.fr (Postfix) with ESMTP id A492F4C80DB; Fri, 8 May 2009 12:21:32 +0200 (CEST) Received: from [192.168.1.3] (che78-2-82-237-71-191.fbx.proxad.net [82.237.71.191]) by smtp4-g21.free.fr (Postfix) with ESMTP id 92A1D4C8125; Fri, 8 May 2009 12:21:29 +0200 (CEST) Message-ID: <4A0407A9.4000009@inria.fr> Date: Fri, 08 May 2009 12:21:29 +0200 From: Xavier Leroy User-Agent: Thunderbird 2.0.0.4 (X11/20070620) MIME-Version: 1.0 To: Dmitry Bely Cc: Caml List Subject: Re: [Caml-list] Ocamlopt x86-32 and SSE2 References: <90823c940904281236x61204451nac149ee15b5df73a@mail.gmail.com> <4A0005C8.8010609@inria.fr> <90823c940905050241y11f012e5xee8316e3e4337ff9@mail.gmail.com> In-Reply-To: <90823c940905050241y11f012e5xee8316e3e4337ff9@mail.gmail.com> X-Enigmail-Version: 0.95.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; ocamlopt:01 ocamlopt:01 sqrt:01 ocaml:01 compiler:01 compilation:01 model:01 command-line:01 gcc:01 variants:01 packagers:01 hypothetical:01 implements:01 wikipedia:01 wiki:01 Dmitry Bely wrote: > I see. Why I asked this: trying to improve floating-point performance > on 32-bit x86 platform I have merged floating-point SSE2 code > generator from amd64 ocamlopt back end to i386 one, making ia32sse2 > architecture. It also inlines sqrt() via -ffast-math flag and slightly > optimizes emit_float_test (usually eliminates an extra jump) - > features that are missed in the original amd64 code generator. You just passed black belt in OCaml compiler hacking :-) > Is this of any interest to anybody? I'm definitely interested in the potential improvements to the amd64 code generator. Concerning the i386 code generator (x86 in 32-bit mode), SSE2 float arithmetic does improve performance and fit ocamlopt's compilation model much better than the current x87 float arithmetic, which is a bit of a hack. Several options can be considered: 1- Have an additional "ia32sse2" port of ocamlopt in parallel with the current "i386" port. 2- Declare pre-SSE2 processors obsolete and convert the current "i386" port to always use SSE2 float arithmetic. 3- Support both x87 and SSE2 float arithmetic within the same i386 port, with a command-line option to activate SSE2, like gcc does. I'm really not keen on approach 1. We have too many ports (and their variants for Windows/MSVC) already. Moreover, I suspect packagers would stick to the i386 port for compatibility with old hardware, and most casual users would, too, out of lazyness, so this hypothetical "ia32sse2" port would receive little testing. Approach 2 is tempting for me because it would simplify the x86-32 code generator and remove some historical cruft. The issue is that it demands a processor that implements SSE2. For a list of processors, see http://en.wikipedia.org/wiki/SSE2 As a rule of thumb, almost all desktop PC bought since 2004 has SSE2, as well as almost all notebooks since 2006. That should be OK for professional users (it's nearly impossible to purchase maintenance beyond 3 years, anyway) and serious hobbyists. However, packagers are going to be very unhappy: Debian still lists i486 as its bottom line; for Fedora, it's Pentium or Pentium II; for Windows, it's "a 1GHz processor", meaning Pentium III. All these processors lack SSE2 support. Only MacOS X is SSE2-compatible from scratch. Approach 3 is probably the best from a user's point of view. But it's going to complicate the code generator: the x87 cruft would still be there, and new cruft would need to be added to support SSE2. Code compiled with the SSE2 flag could link with code compiled without, provided the SSE2 registers are not used for parameter and result passing. But as Dmitry observed, this is already the case in the current ocamlopt compiler. Jean-Marc Eber: >> But again, having better floating point performance (and >> predictable behaviour, compared to the bytecode version) would be a >> big plus for some applications. Dmitry Bely: > Don't quite understand what is "predictable behavior" - any generator > should conform to specs. In my tests x87 and SSE2 backends show the > same results (otherwise it would be called a bug). You haven't tested enough :-). The x87 backend keeps some intermediate results in 80-bit float format, while the SSE2 backend (as well as all other backends and the bytecode interpreter) compute everything in 64-bit format. See David Monniaux's excellent tutorial: http://hal.archives-ouvertes.fr/hal-00128124/en/ Computing intermediate results in extended precision has pros and cons, but my understanding is that the cons slightly outweigh the pros. - Xavier Leroy