Performance of threaded interpreter on hyper-threaded CPU

* Performance of threaded interpreter on hyper-threaded CPU
@ 2006-04-18  8:04 Michel Schinz
  2006-04-18  8:33 ` [Caml-list] " Xavier Leroy
  2006-04-18  8:34 ` [Caml-list] " Christophe TROESTLER
  0 siblings, 2 replies; 14+ messages in thread
From: Michel Schinz @ 2006-04-18  8:04 UTC (permalink / raw)
  To: caml-list

Hi,

In order to get an idea of the speedup provided by a threaded code
interpreter, I've been comparing the performance of the switch-based
OCaml interpreter with the one which uses threaded code.

On some architectures, threaded code provides a speedup which matches
my expectations (around 20%), but on a hyper-threaded Pentium IV, I
actually get a massive slowdown: the threaded code interpreter is more
than two times slower than the switch-based one! I just wanted to
present my results here, since it seems that threaded code might not
always be the fastest option. I'd also be interested in knowing why
threaded code is so much slower in some cases, provided of course that
my results are not flawed.

My testing methodology and result are described below.

To get the two versions of the OCaml interpreter, I uncompress
ocaml-3.09.1 in two separate directories, and in one of them I simply
delete the line which defines THREADED_CODE in byterun/config.h. I
also add the following lines at the very beginning of ocaml_interprete
in byterun/interp.c (in both directories):

#ifdef THREADED_CODE
  fprintf(stderr, "threaded code\n");
#else
  fprintf(stderr, "switch-based\n");
#endif

They enable me to be sure that the version which is running is the one
I expect.

Once this is done, I compile the two versions by first launching
configure with a different prefix for both directories, and then
letting "make world install" do its job. When this is complete, I have
two versions of the interpreter, one using threaded code, the other
using a big switch, and my measurements can start.

To perform my measures, I use a small program which computes the
factorial of 5000 using a naive implementation of big integers
(represented as lists of "digits" in base 10000). I compile this
program once then run it with the switch-based interpreter, and then
with the threaded code one. I run the benchmark five times in a row,
and select the lowest time, as given by the "time" command. The
results are summarised in the following table. When the ratio given in
the last column is greater than 1, then threaded code is faster than
the switch-based solution. As you can see, this is only true in my
case for non-hyper-threaded architectures. Concerning the OS, the
first machine runs OS X 10.4.6, while the other ones run various
versions of Linux.

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.25 GHzPower PC G4               |   9.04 |     7.24 |  1.2486 |
| 1.70 GHz Pentium 4                |   6.36 |     4.81 |  1.3222 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.13 | 0.40946 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.59 | 0.92479 |

I also measured the time taken by "make world" on the third machine,
and the results confirm that the threaded code interpreter is slower
than the switch-based one. Here are the timings:

  switch-based : 89.53s user, 12.73s system
  threaded code: 114.77s user, 13.03 system
  ratio (sw/th): 0.78

I will gladly provide more information about the various systems used
for testing if anyone is interested.

The small benchmark program I'm using is available there:

http://lamp.epfl.ch/~schinz/bignums.ml

Michel.

^ permalink raw reply	[flat|nested] 14+ messages in thread