From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id RAA21339; Mon, 8 Dec 2003 17:28:43 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA20494 for ; Mon, 8 Dec 2003 17:28:41 +0100 (MET) Received: from herd.plethora.net (herd.plethora.net [205.166.146.1]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id hB8GSer06973 for ; Mon, 8 Dec 2003 17:28:40 +0100 (MET) Received: from bhurt.plethora.net (bhurt.plethora.net [205.166.146.49]) by herd.plethora.net (8.11.6/8.10.1) with ESMTP id hB8GSQC21393; Mon, 8 Dec 2003 10:28:28 -0600 (CST) Date: Mon, 8 Dec 2003 11:29:08 -0600 (CST) From: Brian Hurt X-X-Sender: bhurt@localhost.localdomain To: Abdulaziz Ghuloum cc: caml-list@inria.fr Subject: Re: [Caml-list] Object-oriented access bottleneck In-Reply-To: <3FD3BCB2.3090506@cs.indiana.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 bottleneck:01 inlining:01 benchmarked:01 slower:01 misses:01 inlining:01 misses:01 todays:99 latency:01 preload:01 latency:01 avoiding:01 inlined:01 accessor:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Sun, 7 Dec 2003, Abdulaziz Ghuloum wrote: > Brian Hurt wrote: > > >I actually question the value of inlining as a performance improvement, > >unless it leads to other signifigant optimizations. Function calls simply > >aren't that expensive anymore, on today's OOO super-scalar > >speculative-execution CPUs. A direct call, i.e. one not through a > >function pointer, I benchmarked out at about 1.5 clocks on an AMD K6-3. > >Probably less on a more advanced CPU. Indirect calls, i.e. through a > >function pointer, are slower only due to the load to use penalty. If the > >pointer is in L1 cache, an indirect call is probably only 3-8 clocks. > > > >Cache misses are the big cost. Hitting L1 cache, the cheapest memory > >access, is generally 2-4 clocks. L2 cache is generally 6-30 clocks. > >Missing cache entirely and having to go to main memory is 100-300+ clocks. > >Inlining expands the code size, and thus means you're likely having more > >expensive cache misses. At 300 clocks/cache miss, it doesn't take all > >that many cache misses to totally overwhealm the small advantages gained > >by inlining functions. > > > > > > Hello, > > Do you happen to have a pointer to a document listing the (approximate) > timing of the various instructions on todays hardware? You have listed > a few and I was wondering if you have a more comprehensive study. Not really, because the biggest cost (accessing main memory) is too dependent upon specific system costs. For example, switching from DDR233 to DDR333 will greatly reduce the latency costs of accessing main memory. Which Northbridge chipset you're using can also change things. Also, the pattern to the memory accesses can change things- for example, of the P4, accessing memory in a linear fasion that the CPU can predict allows the CPU to preload cache lines, lowering the cost of a cache miss to ~100 clocks. But if you're accessing things randomly, or in a way the CPU can't predict, a cache miss is ~300 clocks. Also, since I'm measuring everything in clocks, changing the clock rate of your CPU changes the measurement. Where I get this information is a couple of years of reading Ace's Hardware (http://www.aceshardware.com/). Google the site for "memory latency". > > You say "Inlining expands the code size and thus you're likely having > more expensive cache misses". I wonder how true that is. For example, > consider a simple expression such as {if p() e0 e1}. If the compiler > decides to inline p (in case p is small enough, leaf node, etc ...), > then in addition to the benefits of inlining (no need to save your live > variable, or restore them later, constant folding, copy propagation, > ...), you're also avoiding the jump to p. Since p can be anywhere in > memory, a cache miss is very probable. If p was inlined, its location > is now close and is less likely to cause a cache miss. Not inlining > causes the PC to be all over the place cauing more cache misses. Am I > missing something? Yes: A) The function p being small enough to not cause code size increases by inlining it is very small. p needs to be really trivial. On the x86 (32-bit) a direct call is 5 bytes worth of instruction. p() needs to be at more 2-3 small instructions after inlining to fit. Going through a virtual function table increases the size of the call, but not by huge amounts. The only place where you are likely to win on a regular basis is accessor functions. IIRC, Ocaml allows you to have member variables be public. If accessor functions are a problem, consider using public variables. B) Most of the time your code is executing in a loop. So the first time through the loop you load all your code into cache and then execute from cache from there on out. The only problem is when the total size of the code is larger than cache. Not inlining code makes the total code size of the loop smaller (generally)- but even if the non-inlined version of the code still doesn't fit into cache, it's often better performance. Imagine a whole bunch of straight line code that doesn't fit into cache, but calls p() multiple times. If there is only a single (non-inlined) copy of p(), then p() is likely to remain in cache because it keeps being used. If there are instead multiple different copies of p(), executing on copy of p() is likely to push another copy of p() out of cache. C) More programming sins are committed in the name of performance than any other reason, including stupidity. Get the code working first, then benchmark, then profile. That will tell you where your performance problems are. Then go back and improve your algorithms. We are down to worrying about clock cycles here (even hundreds of clock cycles)- the point at which you start wondering if the code should be written in C or assembler. I love Ocaml not because it's efficient on the clock cycle level, but because it makes it easier for me to see and work with the high level stuff- algorithms and design issues. Which is where the big savings are to be found. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners