From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 36826BB84
	for <caml-list@yquem.inria.fr>; Tue, 18 Apr 2006 13:41:04 +0200 (CEST)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IBf34W021075
	for <caml-list@yquem.inria.fr>; Tue, 18 Apr 2006 13:41:03 +0200
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id NAA13814 for <caml-list@pauillac.inria.fr>; Tue, 18 Apr 2006 13:41:03 +0200 (MET DST)
Received: from xproxy.gmail.com (xproxy.gmail.com [66.249.82.205])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id k3IBf1YP005640
	for <caml-list@inria.fr>; Tue, 18 Apr 2006 13:41:02 +0200
Received: by xproxy.gmail.com with SMTP id h26so500592wxd
        for <caml-list@inria.fr>; Tue, 18 Apr 2006 04:41:01 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references;
        b=uhQQApjsWHEPyJsp5eGuNwBEKK7Br5m8I/bbwefgoPVPkSjtYpxc63KAoztGqNutsmYCs4GehD53T2Okk/vwxO0simwYkNkXmOiIBms2ikGa39DdDpdf/V6Zu/xFIK0nw9UxxIYlqVspHgNAlpP9gMB2Iu+9ImMuP4UmTiQkQ7g=
Received: by 10.70.117.1 with SMTP id p1mr1927163wxc;
        Tue, 18 Apr 2006 04:40:56 -0700 (PDT)
Received: by 10.70.128.18 with HTTP; Tue, 18 Apr 2006 04:40:56 -0700 (PDT)
Message-ID: <9d3ec8300604180440s74c5c908pe2c9f6f8d344bab7@mail.gmail.com>
Date: Tue, 18 Apr 2006 13:40:56 +0200
From: "Till Varoquaux" <till.varoquaux@gmail.com>
To: "Michel Schinz" <Michel.Schinz@epfl.ch>
Subject: Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
Cc: caml-list@inria.fr
In-Reply-To: <m2slobtbzp.fsf@torre.local>
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_24102_23889084.1145360456848"
References: <m2psjfpavt.fsf@torre.local> <4444A46C.5000102@inria.fr>
	 <m2slobtbzp.fsf@torre.local>
X-Miltered: at concorde with ID 4444D04F.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 4444D04D.002 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; threads:01 threads:01 whith:01 mutexes:01 schinz:01 schinz:01 epfl:01 gcc:01 bytecode:01 gcc:01 gcc's:01 noticable:01 ocamlrun:01 timings:01 timings:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.1 required=5.0 tests=HTML_40_50,HTML_MESSAGE,
	RCVD_BY_IP autolearn=disabled version=3.0.3

------=_Part_24102_23889084.1145360456848
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I might just add that hyperthreading is pretty far from a real
multiprocessor setup...
It works best when the various threads are using differents units of the
cpu, wich is less liable to happen when the threads are unning doing
basically the same thing. A friend of mine has been experimenting on Xeons
recently whith the exact same code (i.e.: multithreaded in both cases) he
gains 12.5% when using hyperthreading. This might be an extreme example...
However supposing you were in the same case it is very conceivable that the
few percent you scrape by  are lost in the machinery required to get
multithreading working properly (mutexes etc...).
Could you try running your multithreaded code on only one of the virtual cp=
u
to see the improvement hyperthreading really brings in?
Till

On 4/18/06, Michel Schinz <Michel.Schinz@epfl.ch> wrote:
>
> Xavier Leroy <Xavier.Leroy@inria.fr> writes:
>
> >  > When the ratio given in the last column is greater than 1, then
> >  > threaded code is faster than the switch-based solution. As you can
> >  > see, this is only true in my case for non-hyper-threaded
> >  > architectures.
> >
> > Which version(s) of gcc do you use for compiling the bytecode
> > interpreter?  Is it the same version on all machines?
>
> No, unfortunately not. Here are the various versions used (I realise
> this variety is annoying, but I have no control over what software
> runs on these machines):
>
> 1.25 GHz PPC G4
>   powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1
>    (Apple Computer, Inc. build 5247)
> 1.70 GHz P4
>   gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
> 3.0 GHz hyper-threaded P4
>   gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
> dual 3.0 GHz hyper-threaded Xeon
>   gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>
> I'm aware of the problem due to gcc's cross-jumping "optimisation"
> (described as you mention by Ertl in [1]). For the record, I tried
> disabling it with -fno-crossjumping, but as Ertl mention, this didn't
> change anything. However, judging by the versions of gcc I'm using,
> cross-jumping should also be performed on the second machine, for
> which threaded code provides a noticable gain...
>
> However, your remark motivated me to measure the performance of a
> single ocamlrun executable running on the various Pentium 4 I have at
> hand, and the results are interesting...
>
> Using the executable produced by gcc 3.2.2, I obtain the following
> timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   6.34 |     4.82 |  1.3154 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |
>
> while using the executable produced by gcc 3.4.4, I obtain the
> following timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   6.26 |     6.70 | 0.93433 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |
>
> Finally, I noticed that gcc 4.0.0 was also available on the second
> machine, so I gave it a try, and obtained the following timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   7.27 |     6.62 |  1.0982 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |
>
> So the threaded code version of the OCaml VM is always slower on the
> hyper-threaded P4, albeit not always by the same amount.
>
> Michel.
>
> [1] http://www.complang.tuwien.ac.at/forth/threading/
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

------=_Part_24102_23889084.1145360456848
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I might just add that hyperthreading is pretty far from a real multiprocess=
or setup...<br>It works best when the various threads are using differents =
units of the cpu, wich is less liable to happen when the threads are unning=
 doing basically the same thing. A friend of mine has been experimenting on=
 Xeons recently whith the exact same code (
i.e.: multithreaded in both cases) he gains 12.5%
when using hyperthreading. This might be an extreme example...<br>However s=
upposing you were in the same case it is very conceivable that the few perc=
ent you scrape by&nbsp; are lost in the machinery required to get multithre=
ading working properly (mutexes etc...).
<br>Could you try running your multithreaded code on only one of the virtua=
l cpu to see the improvement hyperthreading really brings in?<br>Till<br><b=
r><div><span class=3D"gmail_quote">On 4/18/06, <b class=3D"gmail_sendername=
">
Michel Schinz</b> &lt;<a href=3D"mailto:Michel.Schinz@epfl.ch" target=3D"_b=
lank" onclick=3D"return top.js.OpenExtLink(window,event,this)">Michel.Schin=
z@epfl.ch</a>&gt; wrote:</span><blockquote class=3D"gmail_quote" style=3D"b=
order-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; paddin=
g-left: 1ex;">

Xavier Leroy &lt;<a href=3D"mailto:Xavier.Leroy@inria.fr" target=3D"_blank"=
 onclick=3D"return top.js.OpenExtLink(window,event,this)">Xavier.Leroy@inri=
a.fr</a>&gt; writes:<br><br>&gt;&nbsp;&nbsp;&gt; When the ratio given in th=
e last column is greater than 1, then
<br>&gt;&nbsp;&nbsp;&gt; threaded code is faster than the switch-based solu=
tion. As you can
<br>&gt;&nbsp;&nbsp;&gt; see, this is only true in my case for non-hyper-th=
readed<br>&gt;&nbsp;&nbsp;&gt; architectures.<br>&gt;<br>&gt; Which version=
(s) of gcc do you use for compiling the bytecode<br>&gt; interpreter?&nbsp;=
&nbsp;Is it the same version on all machines?
<br><br>No, unfortunately not. Here are the various versions used (I realis=
e<br>this variety is annoying, but I have no control over what software<br>=
runs on these machines):<br><br>1.25 GHz PPC G4<br>&nbsp;&nbsp;powerpc-appl=
e-darwin8-gcc-4.0.1

 (GCC) 4.0.1<br>&nbsp;&nbsp; (Apple Computer, Inc. build 5247)<br>1.70 GHz =
P4<br>&nbsp;&nbsp;gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)<br>3.0 G=
Hz hyper-threaded P4<br>&nbsp;&nbsp;gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4=
-2)<br>dual 3.0 GHz hyper-threaded Xeon
<br>&nbsp;&nbsp;gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)<br><br>I'm aware=
 of the problem due to gcc's cross-jumping &quot;optimisation&quot;<br>(des=
cribed as you mention by Ertl in [1]). For the record, I tried<br>disabling=
 it with -fno-crossjumping, but as Ertl mention, this didn't
<br>change anything. However, judging by the versions of gcc I'm using,<br>=
cross-jumping should also be performed on the second machine, for<br>which =
threaded code provides a noticable gain...<br><br>However, your remark moti=
vated me to measure the performance of a
<br>single ocamlrun executable running on the various Pentium 4 I have at<b=
r>hand, and the results are interesting...<br><br>Using the executable prod=
uced by gcc 3.2.2, I obtain the following<br>timings:<br><br>| architecture=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| switch | threade=
d |&nbsp;&nbsp; ratio |
<br>|-----------------------------------+--------+----------+---------|<br>=
| 1.70 GHz Pentium 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp; 6.34 |&nbsp;&nbsp;&n=
bsp;&nbsp; 4.82 |&nbsp;&nbsp;1.3154 |<br>| 3.0 GHz Pentium 4, hyper-threade=
d |&nbsp;&nbsp; 2.62 |&nbsp;&nbsp;&nbsp;&nbsp; 3.46 | 0.75723 |<br>| dual=
=20
3.0 GHz Xeon, hyper-threaded |&nbsp;&nbsp; 3.36 |&nbsp;&nbsp;&nbsp;&nbsp; 2=
.59 |&nbsp;&nbsp;1.2973 |<br><br>while using the executable produced by gcc=
 3.4.4, I obtain the<br>following timings:<br><br>| architecture&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| switch | threaded |&nbsp;&n=
bsp; ratio |
<br>|-----------------------------------+--------+----------+---------|<br>=
| 1.70 GHz Pentium 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp; 6.26 |&nbsp;&nbsp;&n=
bsp;&nbsp; 6.70 | 0.93433 |<br>| 3.0 GHz Pentium 4, hyper-threaded |&nbsp;&=
nbsp; 2.51 |&nbsp;&nbsp;&nbsp;&nbsp; 6.15 | 0.40813 |<br>| dual=20
3.0 GHz Xeon, hyper-threaded |&nbsp;&nbsp; 3.32 |&nbsp;&nbsp;&nbsp;&nbsp; 3=
.58 | 0.92737 |<br><br>Finally, I noticed that gcc 4.0.0 was also available=
 on the second<br>machine, so I gave it a try, and obtained the following t=
imings:<br><br>| architecture&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;| switch | threaded |&nbsp;&nbsp; ratio |
<br>|-----------------------------------+--------+----------+---------|<br>=
| 1.70 GHz Pentium 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp; 7.27 |&nbsp;&nbsp;&n=
bsp;&nbsp; 6.62 |&nbsp;&nbsp;1.0982 |<br>| 3.0 GHz Pentium 4, hyper-threade=
d |&nbsp;&nbsp; 2.37 |&nbsp;&nbsp;&nbsp;&nbsp; 4.75 | 0.49895 |<br>| dual=
=20
3.0 GHz Xeon, hyper-threaded |&nbsp;&nbsp; 3.91 |&nbsp;&nbsp;&nbsp;&nbsp; 3=
.56 |&nbsp;&nbsp;1.0983 |<br><br>So the threaded code version of the OCaml =
VM is always slower on the<br>hyper-threaded P4, albeit not always by the s=
ame amount.<br><br>Michel.<br><br>[1]=20
<a href=3D"http://www.complang.tuwien.ac.at/forth/threading/" target=3D"_bl=
ank" onclick=3D"return top.js.OpenExtLink(window,event,this)">http://www.co=
mplang.tuwien.ac.at/forth/threading/</a><br><br>___________________________=
____________________
<br>Caml-list mailing list. Subscription management:<br>
<a href=3D"http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list" target=
=3D"_blank" onclick=3D"return top.js.OpenExtLink(window,event,this)">http:/=
/yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list</a><br>Archives: <a href=
=3D"http://caml.inria.fr" target=3D"_blank" onclick=3D"return top.js.OpenEx=
tLink(window,event,this)">
http://caml.inria.fr</a><br>Beginner's list: <a href=3D"http://groups.yahoo=
.com/group/ocaml_beginners" target=3D"_blank" onclick=3D"return top.js.Open=
ExtLink(window,event,this)">
http://groups.yahoo.com/group/ocaml_beginners</a><br>Bug reports: <a href=
=3D"http://caml.inria.fr/bin/caml-bugs" target=3D"_blank" onclick=3D"return=
 top.js.OpenExtLink(window,event,this)">http://caml.inria.fr/bin/caml-bugs<=
/a>
<br></blockquote></div><br>


------=_Part_24102_23889084.1145360456848--