caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] novice puzzled by speed tests
@ 2004-01-03 12:57 dolfi
  2004-01-04 16:49 ` Xavier Leroy
  0 siblings, 1 reply; 4+ messages in thread
From: dolfi @ 2004-01-03 12:57 UTC (permalink / raw)
  To: caml-list


Hi everybody,

I'm new to the list, and deeply impressed by the stability and speed of
OCaml, as well as its functional nature. My best congratulations to the
authors!

Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake
9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt.
The program in question is a prime generator:

let count = 400000

let ptab =
  let ptab = Array.create count 0 in
  ( ptab.(0)<- 2; ptab.(1) <- 3; ptab.(2) <- 5; ptab );;

let rec loop nr toggle psize =
(   let max = truncate (sqrt (float nr)) in
    let rec floop i =
      if ptab.(i)>max then
        ( ptab.(psize) <- nr;
          loop (nr+toggle) (6-toggle) (succ psize) )
      else if nr mod ptab.(i) = 0 then
        loop (nr+toggle) (6-toggle) psize
      else
        floop (succ i)
    in if psize<count then floop 2 else ()
) in loop 5 2 3;

Printf.printf "prime %d: %d\n" count ptab.(pred count);;

On my box, the corresponding C program (gcc -O3) is slightly slower than
the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the
-unsafe one:

#include <stdio.h>
#include <math.h>

#define how_many 400000

int main()
{   unsigned int nr = 5, toggle = 2, max, primes_size = 3, i;
    unsigned int primes[how_many];

    primes[0] = 2; primes[1] = 3; primes[2] = 5;
  loop:
    nr += toggle; toggle = 6 - toggle;
    max = sqrt(nr);
    for (i = 2; primes[i] <= max; ++i)
        if (!(nr % primes[i])) goto loop;
    primes[primes_size++] = nr;
    if (primes_size < how_many) goto loop;

    printf("%i\n", primes[how_many - 1]);
    return 0;
}

Of course it's good that range checking increases the speed of programs,
but, being a long-time C user, I'm a little bit puzzled by miracles like
this. I suspected that the sense of the -unsafe flag was inverted, but it
isn't: the -unsafe program dies with SEGV when I deliberately introduce a
range overflow, the safe one gets an exception.

Till soon, Dolfi




-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] novice puzzled by speed tests
  2004-01-03 12:57 [Caml-list] novice puzzled by speed tests dolfi
@ 2004-01-04 16:49 ` Xavier Leroy
  2004-01-04 20:49   ` Brian Hurt
  2004-01-05 19:50   ` [Caml-list] camlp4 Ker Lutyn
  0 siblings, 2 replies; 4+ messages in thread
From: Xavier Leroy @ 2004-01-04 16:49 UTC (permalink / raw)
  To: dolfi; +Cc: caml-list

> Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake
> 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt.
> On my box, the corresponding C program (gcc -O3) is slightly slower than
> the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the
> -unsafe one:
> Of course it's good that range checking increases the speed of programs,
> but, being a long-time C user, I'm a little bit puzzled by miracles like
> this. I suspected that the sense of the -unsafe flag was inverted, but it
> isn't: the -unsafe program dies with SEGV when I deliberately introduce a
> range overflow, the safe one gets an exception.

Welcome to the wonderful world of modern processors.  It's not
uncommon to observe "absurd" speed differences of up to 20%.  By
"absurd" I mean for instance adding or removing dead code (never
executed) and observing a slowdown, or executing more code and
observing a speedup.

As far as I can guess, this is due to two processor features:

- Lots of instruction-level parallelism is available.  Thus, if your
main code doesn't use all of the computational units, adding extra code
(such as array bound checks) that can execute in parallel doesn't
reduce execution speed.

- Performance is very sensitive to code placement.  Things like code
cache conflicts, (mis-) alignment of branch targets, and oddities in the
instruction decoding logic can cause insertion *or deletion* of a few
instructions to have significant impact on execution speed.

These are just wild guesses.  The bottom line is that processors have
become so complex that explaining observed performances (let alone
predicting performances) has become nearly impossible, at least for
small performance variations (say, less than a factor of 1.5).  

(This makes compiler writers very unhappy, because they used to make a
living by cranking out 5% speed improvements, which are now lost in
the overall noise :-)

If you have access to a good performance analysis tool, such as
Intel's VTune, you could run it on your three executables and see if
some rational explanation comes out of VTune's figures.  But I
wouldn't bet on it.

- Xavier Leroy

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] novice puzzled by speed tests
  2004-01-04 16:49 ` Xavier Leroy
@ 2004-01-04 20:49   ` Brian Hurt
  2004-01-05 19:50   ` [Caml-list] camlp4 Ker Lutyn
  1 sibling, 0 replies; 4+ messages in thread
From: Brian Hurt @ 2004-01-04 20:49 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: dolfi, caml-list

On Sun, 4 Jan 2004, Xavier Leroy wrote:

> > Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake
> > 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt.
> > On my box, the corresponding C program (gcc -O3) is slightly slower than
> > the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the
> > -unsafe one:
> > Of course it's good that range checking increases the speed of programs,
> > but, being a long-time C user, I'm a little bit puzzled by miracles like
> > this. I suspected that the sense of the -unsafe flag was inverted, but it
> > isn't: the -unsafe program dies with SEGV when I deliberately introduce a
> > range overflow, the safe one gets an exception.
> 
> Welcome to the wonderful world of modern processors.  It's not
> uncommon to observe "absurd" speed differences of up to 20%.  By
> "absurd" I mean for instance adding or removing dead code (never
> executed) and observing a slowdown, or executing more code and
> observing a speedup.
> 
> As far as I can guess, this is due to two processor features:
> 
> - Lots of instruction-level parallelism is available.  Thus, if your
> main code doesn't use all of the computational units, adding extra code
> (such as array bound checks) that can execute in parallel doesn't
> reduce execution speed.

Modern processors also do fairly aggressive branch prediction.  Bounds 
checking doesn't *normally* fail, so the processor very quickly starts 
predicting the branch with a very high degree of accuracy.  A correctly 
predicted branch is not exactly no cost, but very near to it.  A 
mispredicted branch is expensive- dozens of clock cycles.  

Hmm.  Actually, as I understand it, all array accesses call the same code
in Ocaml- that code then does the bounds checking.  This means that every
ocaml program has only one branch instruction which is the bounds checking
branch.  Thinking about it, in modern processors this may be better than
hoisting the bounds checking- even in situations where the most bounds
checking can be eliminated in hoisting.  This is because processors only
keep/"cache" information for branch prediction for so many branches-
generally about 4K branches IIRC.  So, program A doesn't hoist it's branch
checking, and branch checking uses the exact same branch.  This means that
only one entry in the branch history cache is used by bounds checking.  
But it executes that branch a lot.  Program B does hoist it's branch
checking- now there are a thousand different branch instructions in the
program to do branch checking.  Now 25% of the branch history cache is for 
bounds checking branches- all of which should be trivially predictable, 
but are pushing other branches out of the branch history.  Boom, the code 
runs slower.

This isn't what this programmer is hitting.  But it demonstrates that 
performance optimization is trickier than it might seem.

> 
> - Performance is very sensitive to code placement.  Things like code
> cache conflicts, (mis-) alignment of branch targets, and oddities in the
> instruction decoding logic can cause insertion *or deletion* of a few
> instructions to have significant impact on execution speed.

Cache conflicts especially.  Modern processors have 100-350 or more clock 
cycle latencies to go to main memory.  Very expensive- to the point where 
I'd argue that the run time of a program is mainly determined by the 
number of cache misses the program has, and not the number of instructions 
executed.  If I had to guess, he's hitting some weird cache behavior.  
Generally, instruction decode weirdness, branch target alignment, etc. 
only cost dozens of clock cycles and don't make that big of a change on 
the performance of code.  It's cache effects that can make surprisingly 
big swings.

You can have poor cache behavior even with working sets much smaller than
cache, due to the way cache works.  Say you have a 2-way set associative
cache of size C.  This means if you're heavily accessing locations X and
X+C, you're okay- but if you're heavily accessing locations X, X+C, and
X+2C, you're screwed.  Your cache can only hold two of those in memory at
once- accessing the third one pushes one of the original two out of cache.  
You make a "minor" change, and suddenly you're accessing locations X, X+C,
and X+2C+L (where L is the cache line size- generally 16-64 bytes), and
boom- all three address lines fit into cache now.  And suddenly your code
is 100 times faster.  This can happen with both data and code.  I've seen 
this happen in real code.

Here's an interesting paper for two reasons- first, it shows just how 
expensive minor changes can get due to cache effects, and second of all it 
has a solution (skew associative caches) that I'd like to see implemented, 
but it has a NIH problem:
http://citeseer.nj.nec.com/bodin94cache.html

Thinking about it, I'd love to see Ocaml do, as it's final linking stage, 
just a quick depth-first traversal sorting of the known-call graph of the 
functions.  I'd be this would do good things to Ocaml's performance.

> 
> These are just wild guesses.  The bottom line is that processors have
> become so complex that explaining observed performances (let alone
> predicting performances) has become nearly impossible, at least for
> small performance variations (say, less than a factor of 1.5).  

Easily.  I've seen performance swings of 100x due to simple changes.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] camlp4
  2004-01-04 16:49 ` Xavier Leroy
  2004-01-04 20:49   ` Brian Hurt
@ 2004-01-05 19:50   ` Ker Lutyn
  1 sibling, 0 replies; 4+ messages in thread
From: Ker Lutyn @ 2004-01-05 19:50 UTC (permalink / raw)
  To: caml-list

Happy New Year!

I just tried to get camlp4 working from cvs:

- configure looks for exact equality between the ocaml version and the version
in camlp4/pcaml.ml

- camlp4/argl.ml does not handle Arg.Tuple and Arg.Bool

Since I wanted camlp4 for a quick one-off and I am under deadline, I can't look
into it more deeply. Let me just add my voice to those others who think it is
too bad that camlp4 has fallen off the map.

Just FYI, here's a cool camlp4 link:

http://www.venge.net/graydon/talks/mkc/html/index.html


__________________________________
Do you Yahoo!?
Find out what made the Top Yahoo! Searches of 2003
http://search.yahoo.com/top2003

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-01-05 19:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-03 12:57 [Caml-list] novice puzzled by speed tests dolfi
2004-01-04 16:49 ` Xavier Leroy
2004-01-04 20:49   ` Brian Hurt
2004-01-05 19:50   ` [Caml-list] camlp4 Ker Lutyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).