From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id C2792BC57 for ; Mon, 29 Mar 2010 20:58:46 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AiwBAJOVsEtKfVK2mGdsb2JhbACRBooXCBUBAQEBAQgJDAcRIrA6gWmFCC6ITAEBAwWCYgiCEgSDGYZp X-IronPort-AV: E=Sophos;i="4.51,329,1267398000"; d="scan'208";a="59892969" Received: from mail-wy0-f182.google.com ([74.125.82.182]) by mail4-smtp-sop.national.inria.fr with ESMTP; 29 Mar 2010 20:58:46 +0200 Received: by wyb33 with SMTP id 33so4995445wyb.27 for ; Mon, 29 Mar 2010 11:58:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Q+mTy4xlQt2x9mJDsSpqqn6H5WqDW5LWZawNkPygZec=; b=QqizIe43UBXf1BfGO3XJpDSr8gsApnL5MLHDUVmOybxZdVIqzo3aZfn0ko/jh/aX5n W+GLtbAzBQ9vfawigKK67f0NgI26mrHsJcYpkVp2Uiq6solNNQBLfVguj9fLklrvqELK YpeAJn95qwqOQue8Np1xkb0RgCASd1kIvSqPA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=E6NSkyKIOWocnpk/sbgj8pSuPKf8vEw+lawoJkRo/FtYliCGxrPZmU+i4WZSQFfVjs J0Fsdf7nUIngf90X0P43kIJ/CA+mOAZc8jLYt+e5NXLDJ2DA28ki08b7F/sE8Rfds+nf vPTE55LpAn4cO70JimEv1dOL7881+jBqRQmrA= MIME-Version: 1.0 Received: by 10.216.89.147 with HTTP; Mon, 29 Mar 2010 11:58:45 -0700 (PDT) In-Reply-To: <4BB0DA17.6090002@inria.fr> References: <4B967857.3070303@inria.fr> <90823c941003230158n5080b46apc7c7462aead3372d@mail.gmail.com> <4BB0DA17.6090002@inria.fr> Date: Mon, 29 Mar 2010 22:58:45 +0400 Received: by 10.216.169.84 with SMTP id m62mr3281090wel.130.1269889125963; Mon, 29 Mar 2010 11:58:45 -0700 (PDT) Message-ID: <90823c941003291158h40dcbdd8s8a473d593072677f@mail.gmail.com> Subject: Re: [Caml-list] testers wanted for experimental SSE2 back-end From: Dmitry Bely To: caml-list@inria.fr Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam: no; 0.00; ocaml:01 compiler:01 gaetan:01 ocaml:01 computations:01 rounding:01 speedups:01 flags:01 simd:01 simd:01 logarithmic:01 lbl:01 opcode:01 cgt:01 cgt:01 On Mon, Mar 29, 2010 at 8:49 PM, Xavier Leroy wrote= : >>> This is a call for testers concerning an experimental OCaml compiler >>> back-end that uses SSE2 instructions for floating-point arithmetic.[...= ] >> >> I cannot provide any benchmark yet > > Too bad :-( I got very little feedback to my call: just one data point > (thanks Gaetan). =A0Perhaps most OCaml users interested in numerical > computations have switched to x86-64bits already? =A0At any rate, given > such a lack of interest, this x86-32/SSE2 port isn't going to make it > into the OCaml distribution. It's a pity. Probably even my (future) benchmarks won't help... >> but even not taking into account >> the better register organization there are at least two areas where >> SSE2 can outperform x87 significantly. >> >> 1. Float to integer conversion >> Is quite inefficient on x87 because you have to explicitly set and >> restore rounding mode. > > Right. =A0The mode change makes the conversion about 10x slower on x87 > than on SSE2. =A0Apparently, float->int conversion is uncommon is > numerical code, otherwise we'd observe bigger speedups on real > applications... > >> 2. Float compare >> Does not set flags on x87 so > > The SSE2 code is prettier than the x87 code, but this doesn't seem to > translate into a significant performance gain, in my limited testing. > >> As for SSE2 backend presented I have some thoughts regarding the code >> (fast math functions via x87 are questionable, > > Most x86-32bits C libraries implement sin(), cos(), etc with the x87 > instructions, so I'm curious to know what you find objectionable here. Microsoft's implementation for P4 and above is SSE2-based. And Intel itself recommends to do so: [quote] What Is AM Library? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Ever missed a sine or arctangent instruction among Intel Streaming SIMD Extensions? Ever wished there were a way to calculate logarithm or exponent in about a dozen cycles? Here is a new release of Approximate Math Library (AM Library) -- a set of fast routines to calculate math functions using Intel(R) Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2). The Library offers trigonometric, reverse trigonometric, logarithmic, and exponential functions for packed and scalar arguments. The processing speed is many times faster than that of x87 instructions and even of table lookups. The accuracy of AM Library routines can be adequate for many applications. It is comparable with that of reciprocal SSE instructions, and is hundreds times better than what is achievable with lookup tables. The AM Library is provided along with the full source code and a usage samp= le. [end of quote] http://www.intel.com/design/pentiumiii/devtools/AMaths.zip Another interesting reading: http://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.= pdf >> optimization of floating compare etc.) Where to discuss that - just >> here or there is some entry in Mantis? > > Why not start on this list? =A0We'll move to private e-mail if the > discussion becomes too heated :-) OK 1. My variant of emit_float_test (in many cases eliminates extra jump). let emit_float_test cmp neg arg lbl =3D let opcode_jp cmp =3D match (cmp, neg) with (Ceq, false) -> ("je", true) | (Ceq, true) -> ("jne", true) | (Cne, false) -> ("jne", true) | (Cne, true) -> ("je", true) | (Clt, false) -> ("jb", true) | (Clt, true) -> ("jae", true) | (Cle, false) -> ("jbe", true) | (Cle, true) -> ("ja", true) | (Cgt, false) -> ("ja", false) | (Cgt, true) -> ("jbe", false) | (Cge, false) -> ("jae", true) | (Cge, true) -> ("jb", false) in let branch_opcode, need_jp =3D opcode_jp cmp in let branch_opcode, arg0, arg1, need_jp =3D match arg.(1).loc with Reg _ when need_jp -> (* swap args if it excludes jmp *) let (branch_opcode_swap, need_jp_swap) =3D opcode_jp (match cmp with Ceq -> Ceq | Cne -> Cne | Clt -> Cgt | Cle -> Cge | Cgt -> Clt | Cge -> Cle) in if need_jp_swap then (branch_opcode, arg.(0), arg.(1), true) else (branch_opcode_swap, arg.(1), arg.(0), false) | _ -> (branch_opcode, arg.(0), arg.(1), need_jp) in begin match cmp with | Ceq | Cne -> ` ucomisd ` | _ -> ` comisd ` end; `{emit_reg arg0}, {emit_reg arg1}\n`; let branch_if_not_comparable =3D if cmp =3D Cne then not neg else neg in if need_jp then if branch_if_not_comparable then begin ` jp {emit_label lbl}\n`; ` {emit_string branch_opcode} {emit_label lbl}\n` end else begin let next =3D new_label() in ` jp {emit_label next}\n`; ` {emit_string branch_opcode} {emit_label lbl}\n`; `{emit_label next}:\n` end else begin ` {emit_string branch_opcode} {emit_label lbl}\n` end 2. My variant of fast math functions (see explanation above) let emit_floatspecial =3D function "sqrt" -> ` sqrtsd ` | _ -> assert false 3. Loading st(0) can be two instructions shorter :) ` sub esp, 8\n`; ` fstp REAL8 PTR [esp]\n`; ` movsd {emit_reg dst}, REAL8 PTR [esp]\n`; ` add esp, 8\n` can be written as ` fstp REAL8 PTR [esp-8]\n`; ` movlpd {emit_reg dst}, REAL8 PTR [esp-8]\n`; 4. Unnecessary instruction in Lop(Iload(Single, addr)) ` movss {emit_reg dest}, REAL4 PTR {emit_addressing addr i.arg 0}\n`; ` cvtss2sd {emit_reg dest}, {emit_reg dest}\n` can be written as ` cvtss2sd {emit_reg dest}, REAL4 PTR {emit_addressing addr i.arg 0}\n` - Dmitry Bely