From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA30081; Sun, 4 Jan 2004 20:47:26 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA30361 for ; Sun, 4 Jan 2004 20:47:25 +0100 (MET) Received: from herd.plethora.net (herd.plethora.net [205.166.146.1]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id i04JlOb19125; Sun, 4 Jan 2004 20:47:24 +0100 (MET) Received: from bhurt.plethora.net (bhurt.plethora.net [205.166.146.49]) by herd.plethora.net (8.11.6/8.10.1) with ESMTP id i04JlJZ01332; Sun, 4 Jan 2004 13:47:20 -0600 (CST) Date: Sun, 4 Jan 2004 14:49:02 -0600 (CST) From: Brian Hurt X-X-Sender: bhurt@localhost.localdomain To: Xavier Leroy cc: dolfi@zkm.de, Subject: Re: [Caml-list] novice puzzled by speed tests In-Reply-To: <20040104174920.A20953@pauillac.inria.fr> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 toying:01 ocamlopt:01 -unsafe:01 mandrake:01 slower:01 ocamlopt:01 gcc:01 slower:01 -unsafe:01 long-time:01 miracles:01 inverted:01 segv:01 speedup:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Sun, 4 Jan 2004, Xavier Leroy wrote: > > Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake > > 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt. > > On my box, the corresponding C program (gcc -O3) is slightly slower than > > the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the > > -unsafe one: > > Of course it's good that range checking increases the speed of programs, > > but, being a long-time C user, I'm a little bit puzzled by miracles like > > this. I suspected that the sense of the -unsafe flag was inverted, but it > > isn't: the -unsafe program dies with SEGV when I deliberately introduce a > > range overflow, the safe one gets an exception. > > Welcome to the wonderful world of modern processors. It's not > uncommon to observe "absurd" speed differences of up to 20%. By > "absurd" I mean for instance adding or removing dead code (never > executed) and observing a slowdown, or executing more code and > observing a speedup. > > As far as I can guess, this is due to two processor features: > > - Lots of instruction-level parallelism is available. Thus, if your > main code doesn't use all of the computational units, adding extra code > (such as array bound checks) that can execute in parallel doesn't > reduce execution speed. Modern processors also do fairly aggressive branch prediction. Bounds checking doesn't *normally* fail, so the processor very quickly starts predicting the branch with a very high degree of accuracy. A correctly predicted branch is not exactly no cost, but very near to it. A mispredicted branch is expensive- dozens of clock cycles. Hmm. Actually, as I understand it, all array accesses call the same code in Ocaml- that code then does the bounds checking. This means that every ocaml program has only one branch instruction which is the bounds checking branch. Thinking about it, in modern processors this may be better than hoisting the bounds checking- even in situations where the most bounds checking can be eliminated in hoisting. This is because processors only keep/"cache" information for branch prediction for so many branches- generally about 4K branches IIRC. So, program A doesn't hoist it's branch checking, and branch checking uses the exact same branch. This means that only one entry in the branch history cache is used by bounds checking. But it executes that branch a lot. Program B does hoist it's branch checking- now there are a thousand different branch instructions in the program to do branch checking. Now 25% of the branch history cache is for bounds checking branches- all of which should be trivially predictable, but are pushing other branches out of the branch history. Boom, the code runs slower. This isn't what this programmer is hitting. But it demonstrates that performance optimization is trickier than it might seem. > > - Performance is very sensitive to code placement. Things like code > cache conflicts, (mis-) alignment of branch targets, and oddities in the > instruction decoding logic can cause insertion *or deletion* of a few > instructions to have significant impact on execution speed. Cache conflicts especially. Modern processors have 100-350 or more clock cycle latencies to go to main memory. Very expensive- to the point where I'd argue that the run time of a program is mainly determined by the number of cache misses the program has, and not the number of instructions executed. If I had to guess, he's hitting some weird cache behavior. Generally, instruction decode weirdness, branch target alignment, etc. only cost dozens of clock cycles and don't make that big of a change on the performance of code. It's cache effects that can make surprisingly big swings. You can have poor cache behavior even with working sets much smaller than cache, due to the way cache works. Say you have a 2-way set associative cache of size C. This means if you're heavily accessing locations X and X+C, you're okay- but if you're heavily accessing locations X, X+C, and X+2C, you're screwed. Your cache can only hold two of those in memory at once- accessing the third one pushes one of the original two out of cache. You make a "minor" change, and suddenly you're accessing locations X, X+C, and X+2C+L (where L is the cache line size- generally 16-64 bytes), and boom- all three address lines fit into cache now. And suddenly your code is 100 times faster. This can happen with both data and code. I've seen this happen in real code. Here's an interesting paper for two reasons- first, it shows just how expensive minor changes can get due to cache effects, and second of all it has a solution (skew associative caches) that I'd like to see implemented, but it has a NIH problem: http://citeseer.nj.nec.com/bodin94cache.html Thinking about it, I'd love to see Ocaml do, as it's final linking stage, just a quick depth-first traversal sorting of the known-call graph of the functions. I'd be this would do good things to Ocaml's performance. > > These are just wild guesses. The bottom line is that processors have > become so complex that explaining observed performances (let alone > predicting performances) has become nearly impossible, at least for > small performance variations (say, less than a factor of 1.5). Easily. I've seen performance swings of 100x due to simple changes. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners