From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id D0D55BB84 for ; Tue, 18 Apr 2006 12:27:41 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IARf2L010892 for ; Tue, 18 Apr 2006 12:27:41 +0200 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA12522 for ; Tue, 18 Apr 2006 12:27:40 +0200 (MET DST) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IARdZ6010885 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Tue, 18 Apr 2006 12:27:40 +0200 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1FVnQg-0004Pj-4P for caml-list@inria.fr; Tue, 18 Apr 2006 12:27:30 +0200 Received: from vpn-epfl-a063.epfl.ch ([128.178.83.83]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 18 Apr 2006 12:27:30 +0200 Received: from Michel.Schinz by vpn-epfl-a063.epfl.ch with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 18 Apr 2006 12:27:30 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: caml-list@inria.fr From: Michel Schinz Subject: Re: Performance of threaded interpreter on hyper-threaded CPU Date: Tue, 18 Apr 2006 12:27:22 +0200 Message-ID: References: <4444A46C.5000102@inria.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: vpn-epfl-a063.epfl.ch User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (darwin) Cancel-Lock: sha1:cASlFeU6Jv37bCWpLO2Fc60dQ0U= Sender: news X-Miltered: at concorde with ID 4444BF1D.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 4444BF1B.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; schinz:01 schinz:01 epfl:01 gcc:01 bytecode:01 gcc:01 gcc's:01 noticable:01 ocamlrun:01 timings:01 timings:01 ocaml:01 complang:01 tuwien:01 threading:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 Xavier Leroy writes: > > When the ratio given in the last column is greater than 1, then > > threaded code is faster than the switch-based solution. As you can > > see, this is only true in my case for non-hyper-threaded > > architectures. > > Which version(s) of gcc do you use for compiling the bytecode > interpreter? Is it the same version on all machines? No, unfortunately not. Here are the various versions used (I realise this variety is annoying, but I have no control over what software runs on these machines): 1.25 GHz PPC G4 powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5247) 1.70 GHz P4 gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) 3.0 GHz hyper-threaded P4 gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) dual 3.0 GHz hyper-threaded Xeon gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) I'm aware of the problem due to gcc's cross-jumping "optimisation" (described as you mention by Ertl in [1]). For the record, I tried disabling it with -fno-crossjumping, but as Ertl mention, this didn't change anything. However, judging by the versions of gcc I'm using, cross-jumping should also be performed on the second machine, for which threaded code provides a noticable gain... However, your remark motivated me to measure the performance of a single ocamlrun executable running on the various Pentium 4 I have at hand, and the results are interesting... Using the executable produced by gcc 3.2.2, I obtain the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 6.34 | 4.82 | 1.3154 | | 3.0 GHz Pentium 4, hyper-threaded | 2.62 | 3.46 | 0.75723 | | dual 3.0 GHz Xeon, hyper-threaded | 3.36 | 2.59 | 1.2973 | while using the executable produced by gcc 3.4.4, I obtain the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 6.26 | 6.70 | 0.93433 | | 3.0 GHz Pentium 4, hyper-threaded | 2.51 | 6.15 | 0.40813 | | dual 3.0 GHz Xeon, hyper-threaded | 3.32 | 3.58 | 0.92737 | Finally, I noticed that gcc 4.0.0 was also available on the second machine, so I gave it a try, and obtained the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 7.27 | 6.62 | 1.0982 | | 3.0 GHz Pentium 4, hyper-threaded | 2.37 | 4.75 | 0.49895 | | dual 3.0 GHz Xeon, hyper-threaded | 3.91 | 3.56 | 1.0983 | So the threaded code version of the OCaml VM is always slower on the hyper-threaded P4, albeit not always by the same amount. Michel. [1] http://www.complang.tuwien.ac.at/forth/threading/