From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id D0D55BB84
	for <caml-list@yquem.inria.fr>; Tue, 18 Apr 2006 12:27:41 +0200 (CEST)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IARf2L010892
	for <caml-list@yquem.inria.fr>; Tue, 18 Apr 2006 12:27:41 +0200
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA12522 for <caml-list@pauillac.inria.fr>; Tue, 18 Apr 2006 12:27:40 +0200 (MET DST)
Received: from ciao.gmane.org (main.gmane.org [80.91.229.2])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IARdZ6010885
	(version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO)
	for <caml-list@inria.fr>; Tue, 18 Apr 2006 12:27:40 +0200
Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1FVnQg-0004Pj-4P
	for caml-list@inria.fr; Tue, 18 Apr 2006 12:27:30 +0200
Received: from vpn-epfl-a063.epfl.ch ([128.178.83.83])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <caml-list@inria.fr>; Tue, 18 Apr 2006 12:27:30 +0200
Received: from Michel.Schinz by vpn-epfl-a063.epfl.ch with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <caml-list@inria.fr>; Tue, 18 Apr 2006 12:27:30 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: caml-list@inria.fr
From: Michel Schinz <Michel.Schinz@epfl.ch>
Subject:  Re: Performance of threaded interpreter on hyper-threaded CPU
Date:  Tue, 18 Apr 2006 12:27:22 +0200
Message-ID:  <m2slobtbzp.fsf@torre.local>
References:  <m2psjfpavt.fsf@torre.local> <4444A46C.5000102@inria.fr>
Mime-Version:  1.0
Content-Type:  text/plain; charset=us-ascii
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: vpn-epfl-a063.epfl.ch
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (darwin)
Cancel-Lock: sha1:cASlFeU6Jv37bCWpLO2Fc60dQ0U=
Sender: news <news@sea.gmane.org>
X-Miltered: at concorde with ID 4444BF1D.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at concorde with ID 4444BF1B.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; schinz:01 schinz:01 epfl:01 gcc:01 bytecode:01 gcc:01 gcc's:01 noticable:01 ocamlrun:01 timings:01 timings:01 ocaml:01 complang:01 tuwien:01 threading:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.3

Xavier Leroy <Xavier.Leroy@inria.fr> writes:

>  > When the ratio given in the last column is greater than 1, then
>  > threaded code is faster than the switch-based solution. As you can
>  > see, this is only true in my case for non-hyper-threaded
>  > architectures.
>
> Which version(s) of gcc do you use for compiling the bytecode
> interpreter?  Is it the same version on all machines?

No, unfortunately not. Here are the various versions used (I realise
this variety is annoying, but I have no control over what software
runs on these machines):

1.25 GHz PPC G4
  powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1
   (Apple Computer, Inc. build 5247)
1.70 GHz P4
  gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
3.0 GHz hyper-threaded P4
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
dual 3.0 GHz hyper-threaded Xeon
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)

I'm aware of the problem due to gcc's cross-jumping "optimisation"
(described as you mention by Ertl in [1]). For the record, I tried
disabling it with -fno-crossjumping, but as Ertl mention, this didn't
change anything. However, judging by the versions of gcc I'm using,
cross-jumping should also be performed on the second machine, for
which threaded code provides a noticable gain...

However, your remark motivated me to measure the performance of a
single ocamlrun executable running on the various Pentium 4 I have at
hand, and the results are interesting...

Using the executable produced by gcc 3.2.2, I obtain the following
timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.34 |     4.82 |  1.3154 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |

while using the executable produced by gcc 3.4.4, I obtain the
following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.26 |     6.70 | 0.93433 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |

Finally, I noticed that gcc 4.0.0 was also available on the second
machine, so I gave it a try, and obtained the following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   7.27 |     6.62 |  1.0982 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |

So the threaded code version of the OCaml VM is always slower on the
hyper-threaded P4, albeit not always by the same amount.

Michel.

[1] http://www.complang.tuwien.ac.at/forth/threading/