caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Performance of threaded interpreter on hyper-threaded CPU
@ 2006-04-18  8:04 Michel Schinz
  2006-04-18  8:33 ` [Caml-list] " Xavier Leroy
  2006-04-18  8:34 ` [Caml-list] " Christophe TROESTLER
  0 siblings, 2 replies; 14+ messages in thread
From: Michel Schinz @ 2006-04-18  8:04 UTC (permalink / raw)
  To: caml-list

Hi,

In order to get an idea of the speedup provided by a threaded code
interpreter, I've been comparing the performance of the switch-based
OCaml interpreter with the one which uses threaded code.

On some architectures, threaded code provides a speedup which matches
my expectations (around 20%), but on a hyper-threaded Pentium IV, I
actually get a massive slowdown: the threaded code interpreter is more
than two times slower than the switch-based one! I just wanted to
present my results here, since it seems that threaded code might not
always be the fastest option. I'd also be interested in knowing why
threaded code is so much slower in some cases, provided of course that
my results are not flawed.

My testing methodology and result are described below.

To get the two versions of the OCaml interpreter, I uncompress
ocaml-3.09.1 in two separate directories, and in one of them I simply
delete the line which defines THREADED_CODE in byterun/config.h. I
also add the following lines at the very beginning of ocaml_interprete
in byterun/interp.c (in both directories):

#ifdef THREADED_CODE
  fprintf(stderr, "threaded code\n");
#else
  fprintf(stderr, "switch-based\n");
#endif

They enable me to be sure that the version which is running is the one
I expect.

Once this is done, I compile the two versions by first launching
configure with a different prefix for both directories, and then
letting "make world install" do its job. When this is complete, I have
two versions of the interpreter, one using threaded code, the other
using a big switch, and my measurements can start.

To perform my measures, I use a small program which computes the
factorial of 5000 using a naive implementation of big integers
(represented as lists of "digits" in base 10000). I compile this
program once then run it with the switch-based interpreter, and then
with the threaded code one. I run the benchmark five times in a row,
and select the lowest time, as given by the "time" command. The
results are summarised in the following table. When the ratio given in
the last column is greater than 1, then threaded code is faster than
the switch-based solution. As you can see, this is only true in my
case for non-hyper-threaded architectures. Concerning the OS, the
first machine runs OS X 10.4.6, while the other ones run various
versions of Linux.

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.25 GHzPower PC G4               |   9.04 |     7.24 |  1.2486 |
| 1.70 GHz Pentium 4                |   6.36 |     4.81 |  1.3222 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.13 | 0.40946 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.59 | 0.92479 |

I also measured the time taken by "make world" on the third machine,
and the results confirm that the threaded code interpreter is slower
than the switch-based one. Here are the timings:

  switch-based : 89.53s user, 12.73s system
  threaded code: 114.77s user, 13.03 system
  ratio (sw/th): 0.78

I will gladly provide more information about the various systems used
for testing if anyone is interested.

The small benchmark program I'm using is available there:

http://lamp.epfl.ch/~schinz/bignums.ml

Michel.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18  8:04 Performance of threaded interpreter on hyper-threaded CPU Michel Schinz
@ 2006-04-18  8:33 ` Xavier Leroy
  2006-04-18 10:27   ` Michel Schinz
  2006-04-18  8:34 ` [Caml-list] " Christophe TROESTLER
  1 sibling, 1 reply; 14+ messages in thread
From: Xavier Leroy @ 2006-04-18  8:33 UTC (permalink / raw)
  To: Michel Schinz; +Cc: caml-list

 > When the ratio given in the last column is greater than 1, then
 > threaded code is faster than the switch-based solution. As you can
 > see, this is only true in my case for non-hyper-threaded
 > architectures.

Which version(s) of gcc do you use for compiling the bytecode
interpreter?  Is it the same version on all machines?

The reason I'm asking is that some versions of gcc are known to
generate poor code for threaded interpreters, e.g. gcc 3.2 generates
inferior code compared to gcc 2.95.  For more info, google for
"Anton Ertl gcc".

- Xavier Leroy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18  8:04 Performance of threaded interpreter on hyper-threaded CPU Michel Schinz
  2006-04-18  8:33 ` [Caml-list] " Xavier Leroy
@ 2006-04-18  8:34 ` Christophe TROESTLER
  2006-04-18  8:46   ` Jonathan Roewen
  1 sibling, 1 reply; 14+ messages in thread
From: Christophe TROESTLER @ 2006-04-18  8:34 UTC (permalink / raw)
  To: OCaml Mailing List

On Tue, 18 Apr 2006, Michel Schinz <Michel.Schinz@epfl.ch> wrote:
> 
> To get the two versions of the OCaml interpreter, I uncompress
> ocaml-3.09.1 in two separate directories, and in one of them I simply

Try again with 3.09.2
http://caml.inria.fr/pub/ml-archives/caml-list/2006/04/93ffa37a3ac448d428344bac0297229e.en.html

>From the message:

  Bug fixes:
  - runtime: inefficiency of signal handling PR#3990

(This influences greatly threaded code.  See the discussion
http://caml.inria.fr/pub/ml-archives/caml-list/2006/03/5eb1b92704b88c1af8f5cba1f623b36a.en.html)

ChriS


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18  8:34 ` [Caml-list] " Christophe TROESTLER
@ 2006-04-18  8:46   ` Jonathan Roewen
  2006-04-18  8:57     ` Christophe TROESTLER
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Roewen @ 2006-04-18  8:46 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: OCaml Mailing List

> >From the message:
>
>  Bug fixes:
>  - runtime: inefficiency of signal handling PR#3990
>
> (This influences greatly threaded code.  See the discussion
> http://caml.inria.fr/pub/ml-archives/caml-list/2006/03/5eb1b92704b88c1af8f5cba1f623b36a.en.html)

This is not threaded as in multi-threading, so I don't believe it
affects the results between switched and threaded.

Jonathan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18  8:46   ` Jonathan Roewen
@ 2006-04-18  8:57     ` Christophe TROESTLER
  0 siblings, 0 replies; 14+ messages in thread
From: Christophe TROESTLER @ 2006-04-18  8:57 UTC (permalink / raw)
  To: caml-list

On Tue, 18 Apr 2006, "Jonathan Roewen" <jonathan.roewen@gmail.com> wrote:
> 
> This is not threaded as in multi-threading, so I don't believe it
> affects the results between switched and threaded.

You are probably correct -- I skimmed through the message too fast!
Sorry about the noise.

ChriS


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18  8:33 ` [Caml-list] " Xavier Leroy
@ 2006-04-18 10:27   ` Michel Schinz
  2006-04-18 11:40     ` [Caml-list] " Till Varoquaux
                       ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Michel Schinz @ 2006-04-18 10:27 UTC (permalink / raw)
  To: caml-list

Xavier Leroy <Xavier.Leroy@inria.fr> writes:

>  > When the ratio given in the last column is greater than 1, then
>  > threaded code is faster than the switch-based solution. As you can
>  > see, this is only true in my case for non-hyper-threaded
>  > architectures.
>
> Which version(s) of gcc do you use for compiling the bytecode
> interpreter?  Is it the same version on all machines?

No, unfortunately not. Here are the various versions used (I realise
this variety is annoying, but I have no control over what software
runs on these machines):

1.25 GHz PPC G4
  powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1
   (Apple Computer, Inc. build 5247)
1.70 GHz P4
  gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
3.0 GHz hyper-threaded P4
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
dual 3.0 GHz hyper-threaded Xeon
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)

I'm aware of the problem due to gcc's cross-jumping "optimisation"
(described as you mention by Ertl in [1]). For the record, I tried
disabling it with -fno-crossjumping, but as Ertl mention, this didn't
change anything. However, judging by the versions of gcc I'm using,
cross-jumping should also be performed on the second machine, for
which threaded code provides a noticable gain...

However, your remark motivated me to measure the performance of a
single ocamlrun executable running on the various Pentium 4 I have at
hand, and the results are interesting...

Using the executable produced by gcc 3.2.2, I obtain the following
timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.34 |     4.82 |  1.3154 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |

while using the executable produced by gcc 3.4.4, I obtain the
following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.26 |     6.70 | 0.93433 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |

Finally, I noticed that gcc 4.0.0 was also available on the second
machine, so I gave it a try, and obtained the following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   7.27 |     6.62 |  1.0982 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |

So the threaded code version of the OCaml VM is always slower on the
hyper-threaded P4, albeit not always by the same amount.

Michel.

[1] http://www.complang.tuwien.ac.at/forth/threading/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 10:27   ` Michel Schinz
@ 2006-04-18 11:40     ` Till Varoquaux
  2006-04-18 11:59       ` Michel Schinz
  2006-04-18 12:56     ` Stefan Monnier
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Till Varoquaux @ 2006-04-18 11:40 UTC (permalink / raw)
  To: Michel Schinz; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 4310 bytes --]

I might just add that hyperthreading is pretty far from a real
multiprocessor setup...
It works best when the various threads are using differents units of the
cpu, wich is less liable to happen when the threads are unning doing
basically the same thing. A friend of mine has been experimenting on Xeons
recently whith the exact same code (i.e.: multithreaded in both cases) he
gains 12.5% when using hyperthreading. This might be an extreme example...
However supposing you were in the same case it is very conceivable that the
few percent you scrape by  are lost in the machinery required to get
multithreading working properly (mutexes etc...).
Could you try running your multithreaded code on only one of the virtual cpu
to see the improvement hyperthreading really brings in?
Till

On 4/18/06, Michel Schinz <Michel.Schinz@epfl.ch> wrote:
>
> Xavier Leroy <Xavier.Leroy@inria.fr> writes:
>
> >  > When the ratio given in the last column is greater than 1, then
> >  > threaded code is faster than the switch-based solution. As you can
> >  > see, this is only true in my case for non-hyper-threaded
> >  > architectures.
> >
> > Which version(s) of gcc do you use for compiling the bytecode
> > interpreter?  Is it the same version on all machines?
>
> No, unfortunately not. Here are the various versions used (I realise
> this variety is annoying, but I have no control over what software
> runs on these machines):
>
> 1.25 GHz PPC G4
>   powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1
>    (Apple Computer, Inc. build 5247)
> 1.70 GHz P4
>   gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
> 3.0 GHz hyper-threaded P4
>   gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
> dual 3.0 GHz hyper-threaded Xeon
>   gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>
> I'm aware of the problem due to gcc's cross-jumping "optimisation"
> (described as you mention by Ertl in [1]). For the record, I tried
> disabling it with -fno-crossjumping, but as Ertl mention, this didn't
> change anything. However, judging by the versions of gcc I'm using,
> cross-jumping should also be performed on the second machine, for
> which threaded code provides a noticable gain...
>
> However, your remark motivated me to measure the performance of a
> single ocamlrun executable running on the various Pentium 4 I have at
> hand, and the results are interesting...
>
> Using the executable produced by gcc 3.2.2, I obtain the following
> timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   6.34 |     4.82 |  1.3154 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |
>
> while using the executable produced by gcc 3.4.4, I obtain the
> following timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   6.26 |     6.70 | 0.93433 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |
>
> Finally, I noticed that gcc 4.0.0 was also available on the second
> machine, so I gave it a try, and obtained the following timings:
>
> | architecture                      | switch | threaded |   ratio |
> |-----------------------------------+--------+----------+---------|
> | 1.70 GHz Pentium 4                |   7.27 |     6.62 |  1.0982 |
> | 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |
>
> So the threaded code version of the OCaml VM is always slower on the
> hyper-threaded P4, albeit not always by the same amount.
>
> Michel.
>
> [1] http://www.complang.tuwien.ac.at/forth/threading/
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 6530 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 11:40     ` [Caml-list] " Till Varoquaux
@ 2006-04-18 11:59       ` Michel Schinz
  0 siblings, 0 replies; 14+ messages in thread
From: Michel Schinz @ 2006-04-18 11:59 UTC (permalink / raw)
  To: caml-list

"Till Varoquaux" <till.varoquaux@gmail.com> writes:

[...]

> Could you try running your multithreaded code on only one of the
> virtual cpu to see the improvement hyperthreading really brings in?

To clarify things: I'm not talking about a multi-threaded program.

"Threaded code" is a technique which is commonly used to speed up
dispatching in interpreters. It is relatively well described on the
following page:

http://www.complang.tuwien.ac.at/forth/threaded-code.html

The OCaml VM is written in such a way that it uses that technique if
possible (basically if it is compiled using a recent gcc, which offers
the extensions needed to implement threaded code), and falls back to a
"standard" switch-based dispatching technique otherwise.

My observation is that in some circumstances, threaded code seems to
slow down the VM instead of speeding it up as it should. (The biggest
slowdown being observed on a hyper-threaded architecture, but I have
no idea why).

Michel.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 10:27   ` Michel Schinz
  2006-04-18 11:40     ` [Caml-list] " Till Varoquaux
@ 2006-04-18 12:56     ` Stefan Monnier
  2006-04-18 16:18     ` [Caml-list] " Xavier Leroy
  2006-04-25 22:52     ` [Caml-list] " Joaquin Cuenca Abela
  3 siblings, 0 replies; 14+ messages in thread
From: Stefan Monnier @ 2006-04-18 12:56 UTC (permalink / raw)
  To: caml-list

> | 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |

> | 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |

> | 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
> | dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |

"Xeon" and "Pentium 4" are marketing names that refer to two different
packaging of basically the same set of processors (depending on whether
it's targetted at servers or at desktops/laptops).  Worse, the set of
processors covered by each name is actually pretty large with some
significant differences in their internal pipeline (some specifically
targetted at making hyperthreading suck less).

Have you tried to turn HT off and run your tests again?

What OS was used?  Were there other processes active at the same time?

Maybe people on comp.arch can slove this puzzle,


        Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 10:27   ` Michel Schinz
  2006-04-18 11:40     ` [Caml-list] " Till Varoquaux
  2006-04-18 12:56     ` Stefan Monnier
@ 2006-04-18 16:18     ` Xavier Leroy
  2006-04-18 16:42       ` gang chen
  2006-04-19  8:24       ` Michel Schinz
  2006-04-25 22:52     ` [Caml-list] " Joaquin Cuenca Abela
  3 siblings, 2 replies; 14+ messages in thread
From: Xavier Leroy @ 2006-04-18 16:18 UTC (permalink / raw)
  To: Michel Schinz; +Cc: caml-list

 > However, your remark motivated me to measure the performance of a
 > single ocamlrun executable running on the various Pentium 4 I have at
 > hand, and the results are interesting...

Random thoughts:

The performance variations between the gcc versions confirm my
impression that gcc is getting "too clever for its own good" --
carefully hand-optimized code like the OCaml bytecode interpreter
is best served by a compiler that compiles code nearly as written.
(Think gcc 2.95.)

The P4 microarchitecture is known for its weird performance model:
some code runs very fast, some similar code very slow.
In my experience, AMD processors as well as the Pentium-M are
much more consistent performance-wise.

If you really want to understand what's going on, you need a good
performance analysis tool.  Timing runs will tell you nothing.
Intel's VTUNE is king of the hill here, but the Windows version is
costly and I could never install the free Linux version.

- Xavier Leroy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 16:18     ` [Caml-list] " Xavier Leroy
@ 2006-04-18 16:42       ` gang chen
  2006-04-19  8:24       ` Michel Schinz
  1 sibling, 0 replies; 14+ messages in thread
From: gang chen @ 2006-04-18 16:42 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

Does ocaml have benchmark programs for use by C
compiler writers and architecture developers as
performance measurement ?

Gang Chen

--- Xavier Leroy <Xavier.Leroy@inria.fr> wrote:


> The performance variations between the gcc versions
> confirm my
> impression that gcc is getting "too clever for its
> own good" --
> carefully hand-optimized code like the OCaml
> bytecode interpreter
> is best served by a compiler that compiles code
> nearly as written.
> (Think gcc 2.95.)
> 
> The P4 microarchitecture is known for its weird
> performance model:
> some code runs very fast, some similar code very
> slow.
> In my experience, AMD processors as well as the
> Pentium-M are
> much more consistent performance-wise.
> 
> If you really want to understand what's going on,
> you need a good
> performance analysis tool.  Timing runs will tell
> you nothing.
> Intel's VTUNE is king of the hill here, but the
> Windows version is
> costly and I could never install the free Linux
> version.
> 
> - Xavier Leroy
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
>
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list:
> http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 16:18     ` [Caml-list] " Xavier Leroy
  2006-04-18 16:42       ` gang chen
@ 2006-04-19  8:24       ` Michel Schinz
  1 sibling, 0 replies; 14+ messages in thread
From: Michel Schinz @ 2006-04-19  8:24 UTC (permalink / raw)
  To: caml-list

Xavier Leroy <Xavier.Leroy@inria.fr> writes:

[...]

> The P4 microarchitecture is known for its weird performance model:
> some code runs very fast, some similar code very slow.
> In my experience, AMD processors as well as the Pentium-M are
> much more consistent performance-wise.

Ok, I wasn't aware of this. I don't have access to machines using AMD
or Pentium-M processors, unfortunately.

Anyway, the timing of the OCaml VM was just a side experiment for me:
I wanted to see whether the speedup provided by threaded code in the
OCaml VM was similar to the one it provides in a VM we develop for a
course. Now that I've answered that question, I think I'll leave it at
that.

Thank you for your suggestion about VTUNE, I'll think about it if I
ever need a detailed analysis tool.

Michel.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-18 10:27   ` Michel Schinz
                       ` (2 preceding siblings ...)
  2006-04-18 16:18     ` [Caml-list] " Xavier Leroy
@ 2006-04-25 22:52     ` Joaquin Cuenca Abela
  2006-04-27 11:42       ` Joaquin Cuenca Abela
  3 siblings, 1 reply; 14+ messages in thread
From: Joaquin Cuenca Abela @ 2006-04-25 22:52 UTC (permalink / raw)
  To: caml-list

Michel wrote:

> I'm aware of the problem due to gcc's cross-jumping "optimisation"
> (described as you mention by Ertl in [1]). For the record, I tried
> disabling it with -fno-crossjumping, but as Ertl mention, this didn't
> change anything. However, judging by the versions of gcc I'm using,
> cross-jumping should also be performed on the second machine, for
> which threaded code provides a noticable gain...

Hi,

FWIW, I did some tests with gcc 3.4.2, and -fno-crossjumping works *sometimes*. If you combine it with -O2 *and* your virtual machine has less than 10 opcodes (??) then -fno-crossjumping works as expected.

Otherwise (ie, with any real virtual machine) gcc generates an extra jump, but you don't get exactly the same assembler than if you use a switched vm. The assembly for the switched vm pushes more instructions in the extra basic block than the assembly for the threaded vm.

In the threaded vm, the extra block is only (in my test) a jmp *%eax.

Cheers,

--
Joaquin Cuenca Abela



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Re: Performance of threaded interpreter on hyper-threaded CPU
  2006-04-25 22:52     ` [Caml-list] " Joaquin Cuenca Abela
@ 2006-04-27 11:42       ` Joaquin Cuenca Abela
  0 siblings, 0 replies; 14+ messages in thread
From: Joaquin Cuenca Abela @ 2006-04-27 11:42 UTC (permalink / raw)
  To: caml-list

Joaquin wrote:
>
> FWIW, I did some tests with gcc 3.4.2, and -fno-crossjumping
> works *sometimes*. If you combine it with -O2 *and* your virtual
> machine has less than 10 opcodes (??) then -fno-crossjumping
> works as expected.

And replying to myself, this problem seems to be fixed in gcc 4.0.0

See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242

-- 

Joaquin Cuenca Abela



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-04-27 11:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-18  8:04 Performance of threaded interpreter on hyper-threaded CPU Michel Schinz
2006-04-18  8:33 ` [Caml-list] " Xavier Leroy
2006-04-18 10:27   ` Michel Schinz
2006-04-18 11:40     ` [Caml-list] " Till Varoquaux
2006-04-18 11:59       ` Michel Schinz
2006-04-18 12:56     ` Stefan Monnier
2006-04-18 16:18     ` [Caml-list] " Xavier Leroy
2006-04-18 16:42       ` gang chen
2006-04-19  8:24       ` Michel Schinz
2006-04-25 22:52     ` [Caml-list] " Joaquin Cuenca Abela
2006-04-27 11:42       ` Joaquin Cuenca Abela
2006-04-18  8:34 ` [Caml-list] " Christophe TROESTLER
2006-04-18  8:46   ` Jonathan Roewen
2006-04-18  8:57     ` Christophe TROESTLER

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).