From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 36826BB84 for ; Tue, 18 Apr 2006 13:41:04 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IBf34W021075 for ; Tue, 18 Apr 2006 13:41:03 +0200 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id NAA13814 for ; Tue, 18 Apr 2006 13:41:03 +0200 (MET DST) Received: from xproxy.gmail.com (xproxy.gmail.com [66.249.82.205]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id k3IBf1YP005640 for ; Tue, 18 Apr 2006 13:41:02 +0200 Received: by xproxy.gmail.com with SMTP id h26so500592wxd for ; Tue, 18 Apr 2006 04:41:01 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=uhQQApjsWHEPyJsp5eGuNwBEKK7Br5m8I/bbwefgoPVPkSjtYpxc63KAoztGqNutsmYCs4GehD53T2Okk/vwxO0simwYkNkXmOiIBms2ikGa39DdDpdf/V6Zu/xFIK0nw9UxxIYlqVspHgNAlpP9gMB2Iu+9ImMuP4UmTiQkQ7g= Received: by 10.70.117.1 with SMTP id p1mr1927163wxc; Tue, 18 Apr 2006 04:40:56 -0700 (PDT) Received: by 10.70.128.18 with HTTP; Tue, 18 Apr 2006 04:40:56 -0700 (PDT) Message-ID: <9d3ec8300604180440s74c5c908pe2c9f6f8d344bab7@mail.gmail.com> Date: Tue, 18 Apr 2006 13:40:56 +0200 From: "Till Varoquaux" To: "Michel Schinz" Subject: Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU Cc: caml-list@inria.fr In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_24102_23889084.1145360456848" References: <4444A46C.5000102@inria.fr> X-Miltered: at concorde with ID 4444D04F.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 4444D04D.002 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; threads:01 threads:01 whith:01 mutexes:01 schinz:01 schinz:01 epfl:01 gcc:01 bytecode:01 gcc:01 gcc's:01 noticable:01 ocamlrun:01 timings:01 timings:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=HTML_40_50,HTML_MESSAGE, RCVD_BY_IP autolearn=disabled version=3.0.3 ------=_Part_24102_23889084.1145360456848 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline I might just add that hyperthreading is pretty far from a real multiprocessor setup... It works best when the various threads are using differents units of the cpu, wich is less liable to happen when the threads are unning doing basically the same thing. A friend of mine has been experimenting on Xeons recently whith the exact same code (i.e.: multithreaded in both cases) he gains 12.5% when using hyperthreading. This might be an extreme example... However supposing you were in the same case it is very conceivable that the few percent you scrape by are lost in the machinery required to get multithreading working properly (mutexes etc...). Could you try running your multithreaded code on only one of the virtual cp= u to see the improvement hyperthreading really brings in? Till On 4/18/06, Michel Schinz wrote: > > Xavier Leroy writes: > > > > When the ratio given in the last column is greater than 1, then > > > threaded code is faster than the switch-based solution. As you can > > > see, this is only true in my case for non-hyper-threaded > > > architectures. > > > > Which version(s) of gcc do you use for compiling the bytecode > > interpreter? Is it the same version on all machines? > > No, unfortunately not. Here are the various versions used (I realise > this variety is annoying, but I have no control over what software > runs on these machines): > > 1.25 GHz PPC G4 > powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 > (Apple Computer, Inc. build 5247) > 1.70 GHz P4 > gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) > 3.0 GHz hyper-threaded P4 > gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) > dual 3.0 GHz hyper-threaded Xeon > gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) > > I'm aware of the problem due to gcc's cross-jumping "optimisation" > (described as you mention by Ertl in [1]). For the record, I tried > disabling it with -fno-crossjumping, but as Ertl mention, this didn't > change anything. However, judging by the versions of gcc I'm using, > cross-jumping should also be performed on the second machine, for > which threaded code provides a noticable gain... > > However, your remark motivated me to measure the performance of a > single ocamlrun executable running on the various Pentium 4 I have at > hand, and the results are interesting... > > Using the executable produced by gcc 3.2.2, I obtain the following > timings: > > | architecture | switch | threaded | ratio | > |-----------------------------------+--------+----------+---------| > | 1.70 GHz Pentium 4 | 6.34 | 4.82 | 1.3154 | > | 3.0 GHz Pentium 4, hyper-threaded | 2.62 | 3.46 | 0.75723 | > | dual 3.0 GHz Xeon, hyper-threaded | 3.36 | 2.59 | 1.2973 | > > while using the executable produced by gcc 3.4.4, I obtain the > following timings: > > | architecture | switch | threaded | ratio | > |-----------------------------------+--------+----------+---------| > | 1.70 GHz Pentium 4 | 6.26 | 6.70 | 0.93433 | > | 3.0 GHz Pentium 4, hyper-threaded | 2.51 | 6.15 | 0.40813 | > | dual 3.0 GHz Xeon, hyper-threaded | 3.32 | 3.58 | 0.92737 | > > Finally, I noticed that gcc 4.0.0 was also available on the second > machine, so I gave it a try, and obtained the following timings: > > | architecture | switch | threaded | ratio | > |-----------------------------------+--------+----------+---------| > | 1.70 GHz Pentium 4 | 7.27 | 6.62 | 1.0982 | > | 3.0 GHz Pentium 4, hyper-threaded | 2.37 | 4.75 | 0.49895 | > | dual 3.0 GHz Xeon, hyper-threaded | 3.91 | 3.56 | 1.0983 | > > So the threaded code version of the OCaml VM is always slower on the > hyper-threaded P4, albeit not always by the same amount. > > Michel. > > [1] http://www.complang.tuwien.ac.at/forth/threading/ > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ------=_Part_24102_23889084.1145360456848 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline I might just add that hyperthreading is pretty far from a real multiprocess= or setup...
It works best when the various threads are using differents = units of the cpu, wich is less liable to happen when the threads are unning= doing basically the same thing. A friend of mine has been experimenting on= Xeons recently whith the exact same code ( i.e.: multithreaded in both cases) he gains 12.5% when using hyperthreading. This might be an extreme example...
However s= upposing you were in the same case it is very conceivable that the few perc= ent you scrape by  are lost in the machinery required to get multithre= ading working properly (mutexes etc...).
Could you try running your multithreaded code on only one of the virtua= l cpu to see the improvement hyperthreading really brings in?
Till
On 4/18/06, Michel Schinz <Michel.Schin= z@epfl.ch> wrote:
Xavier Leroy <Xavier.Leroy@inri= a.fr> writes:

>  > When the ratio given in th= e last column is greater than 1, then
>  > threaded code is faster than the switch-based solu= tion. As you can
>  > see, this is only true in my case for non-hyper-th= readed
>  > architectures.
>
> Which version= (s) of gcc do you use for compiling the bytecode
> interpreter? =  Is it the same version on all machines?

No, unfortunately not. Here are the various versions used (I realis= e
this variety is annoying, but I have no control over what software
= runs on these machines):

1.25 GHz PPC G4
  powerpc-appl= e-darwin8-gcc-4.0.1 (GCC) 4.0.1
   (Apple Computer, Inc. build 5247)
1.70 GHz = P4
  gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
3.0 G= Hz hyper-threaded P4
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4= -2)
dual 3.0 GHz hyper-threaded Xeon
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)

I'm aware= of the problem due to gcc's cross-jumping "optimisation"
(des= cribed as you mention by Ertl in [1]). For the record, I tried
disabling= it with -fno-crossjumping, but as Ertl mention, this didn't
change anything. However, judging by the versions of gcc I'm using,
= cross-jumping should also be performed on the second machine, for
which = threaded code provides a noticable gain...

However, your remark moti= vated me to measure the performance of a
single ocamlrun executable running on the various Pentium 4 I have athand, and the results are interesting...

Using the executable prod= uced by gcc 3.2.2, I obtain the following
timings:

| architecture=             &nb= sp;         | switch | threade= d |   ratio |
|-----------------------------------+--------+----------+---------|
= | 1.70 GHz Pentium 4         &= nbsp;      |   6.34 |  &n= bsp;  4.82 |  1.3154 |
| 3.0 GHz Pentium 4, hyper-threade= d |   2.62 |     3.46 | 0.75723 |
| dual= =20 3.0 GHz Xeon, hyper-threaded |   3.36 |     2= .59 |  1.2973 |

while using the executable produced by gcc= 3.4.4, I obtain the
following timings:

| architecture  = ;            &n= bsp;       | switch | threaded | &n= bsp; ratio |
|-----------------------------------+--------+----------+---------|
= | 1.70 GHz Pentium 4         &= nbsp;      |   6.26 |  &n= bsp;  6.70 | 0.93433 |
| 3.0 GHz Pentium 4, hyper-threaded | &= nbsp; 2.51 |     6.15 | 0.40813 |
| dual=20 3.0 GHz Xeon, hyper-threaded |   3.32 |     3= .58 | 0.92737 |

Finally, I noticed that gcc 4.0.0 was also available= on the second
machine, so I gave it a try, and obtained the following t= imings:

| architecture       &nbs= p;            &= nbsp; | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
= | 1.70 GHz Pentium 4         &= nbsp;      |   7.27 |  &n= bsp;  6.62 |  1.0982 |
| 3.0 GHz Pentium 4, hyper-threade= d |   2.37 |     4.75 | 0.49895 |
| dual= =20 3.0 GHz Xeon, hyper-threaded |   3.91 |     3= .56 |  1.0983 |

So the threaded code version of the OCaml = VM is always slower on the
hyper-threaded P4, albeit not always by the s= ame amount.

Michel.

[1]=20 http://www.co= mplang.tuwien.ac.at/forth/threading/

___________________________= ____________________
Caml-list mailing list. Subscription management:
http:/= /yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs<= /a>

------=_Part_24102_23889084.1145360456848--