From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 373F0BB84 for ; Tue, 18 Apr 2006 10:10:09 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id k3I8A8iZ012728 for ; Tue, 18 Apr 2006 10:10:08 +0200 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA10434 for ; Tue, 18 Apr 2006 10:10:08 +0200 (MET DST) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3I8A7rf023199 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Tue, 18 Apr 2006 10:10:07 +0200 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1FVlHe-00089R-01 for caml-list@inria.fr; Tue, 18 Apr 2006 10:10:02 +0200 Received: from vpn-epfl-a063.epfl.ch ([128.178.83.83]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 18 Apr 2006 10:10:01 +0200 Received: from Michel.Schinz by vpn-epfl-a063.epfl.ch with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 18 Apr 2006 10:10:01 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: caml-list@inria.fr From: Michel Schinz Subject: Performance of threaded interpreter on hyper-threaded CPU Date: Tue, 18 Apr 2006 10:04:54 +0200 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: vpn-epfl-a063.epfl.ch User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (darwin) Cancel-Lock: sha1:rlk8Lws3kwXxEqIAJTfkZpkFzLI= Sender: news X-Miltered: at nez-perce with ID 44449EE0.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 44449EDF.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; schinz:01 schinz:01 epfl:01 speedup:01 ocaml:01 speedup:01 ocaml:01 ocaml-:01 defines:01 byterun:01 config:01 interprete:01 byterun:01 interp:01 ifdef:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 Hi, In order to get an idea of the speedup provided by a threaded code interpreter, I've been comparing the performance of the switch-based OCaml interpreter with the one which uses threaded code. On some architectures, threaded code provides a speedup which matches my expectations (around 20%), but on a hyper-threaded Pentium IV, I actually get a massive slowdown: the threaded code interpreter is more than two times slower than the switch-based one! I just wanted to present my results here, since it seems that threaded code might not always be the fastest option. I'd also be interested in knowing why threaded code is so much slower in some cases, provided of course that my results are not flawed. My testing methodology and result are described below. To get the two versions of the OCaml interpreter, I uncompress ocaml-3.09.1 in two separate directories, and in one of them I simply delete the line which defines THREADED_CODE in byterun/config.h. I also add the following lines at the very beginning of ocaml_interprete in byterun/interp.c (in both directories): #ifdef THREADED_CODE fprintf(stderr, "threaded code\n"); #else fprintf(stderr, "switch-based\n"); #endif They enable me to be sure that the version which is running is the one I expect. Once this is done, I compile the two versions by first launching configure with a different prefix for both directories, and then letting "make world install" do its job. When this is complete, I have two versions of the interpreter, one using threaded code, the other using a big switch, and my measurements can start. To perform my measures, I use a small program which computes the factorial of 5000 using a naive implementation of big integers (represented as lists of "digits" in base 10000). I compile this program once then run it with the switch-based interpreter, and then with the threaded code one. I run the benchmark five times in a row, and select the lowest time, as given by the "time" command. The results are summarised in the following table. When the ratio given in the last column is greater than 1, then threaded code is faster than the switch-based solution. As you can see, this is only true in my case for non-hyper-threaded architectures. Concerning the OS, the first machine runs OS X 10.4.6, while the other ones run various versions of Linux. | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.25 GHzPower PC G4 | 9.04 | 7.24 | 1.2486 | | 1.70 GHz Pentium 4 | 6.36 | 4.81 | 1.3222 | | 3.0 GHz Pentium 4, hyper-threaded | 2.51 | 6.13 | 0.40946 | | dual 3.0 GHz Xeon, hyper-threaded | 3.32 | 3.59 | 0.92479 | I also measured the time taken by "make world" on the third machine, and the results confirm that the threaded code interpreter is slower than the switch-based one. Here are the timings: switch-based : 89.53s user, 12.73s system threaded code: 114.77s user, 13.03 system ratio (sw/th): 0.78 I will gladly provide more information about the various systems used for testing if anyone is interested. The small benchmark program I'm using is available there: http://lamp.epfl.ch/~schinz/bignums.ml Michel.