From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA07585; Sun, 20 Oct 2002 11:42:59 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA07457 for ; Sun, 20 Oct 2002 11:42:58 +0200 (MET DST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g9K9gp527984; Sun, 20 Oct 2002 11:42:51 +0200 (MET DST) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA05863; Sun, 20 Oct 2002 11:42:51 +0200 (MET DST) Date: Sun, 20 Oct 2002 11:42:51 +0200 From: Xavier Leroy To: Paul Stodghill Cc: caml-list@inria.fr Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C Message-ID: <20021020114251.A7642@pauillac.inria.fr> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from stodghil@CS.Cornell.EDU on Fri, Oct 18, 2002 at 04:09:59PM -0400 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > Looking at the assembly produced by O'Caml and GCC, it appears that GCC > is performance loop unrolling (as requested with -funroll-loops) and > strength reduction in the inner loops. I can easily see why these two > optimizations can result in such a tremendous performance difference. > > My question is this: I can obviously performance loop unrolling myself > by hand - does ocamlopt perform strength reduction? No, it doesn't. It doesn't hoist loop-invariant computations either. (See http://caml.inria.fr/ocaml/numerical.html) Multiplications are pretty expensive on modern processors, so your code would run much faster if you hoist the computation i*120 out of the j and k loops, and strength-reduce k*120 in the innermost loop. Another option worth considering is to swap the j and k loops, thus making k*120 and a.(i*120+k) two invariants of the inner loop. In this case, you might not even have to strength-reduce k*120. Other loop optimizations such as loop unrolling would (I guess) make much less of a difference -- modern processors are very good at branch prediction, making loop overhead low. Some more thoughts: Bagley's code works on a "proper" matrix (an array of arrays), while yours flatten the matrix into one single array. There are pros and cons to either approach, but notice that your approach (without strength reduction) entails fewer loads but more multiplications, which can be more expensive... Tiling loops is good for really large matrices, but yours (in this example) occupies "only" 115 Kb, thus fits comfortably in L2 cache. Perhaps tiling isn't beneficial in this case. - Xavier Leroy ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners