From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA07585; Sun, 20 Oct 2002 11:42:59 +0200 (MET DST)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA07457 for <caml-list@pauillac.inria.fr>; Sun, 20 Oct 2002 11:42:58 +0200 (MET DST)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g9K9gp527984;
	Sun, 20 Oct 2002 11:42:51 +0200 (MET DST)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA05863; Sun, 20 Oct 2002 11:42:51 +0200 (MET DST)
Date: Sun, 20 Oct 2002 11:42:51 +0200
From: Xavier Leroy <xavier.leroy@inria.fr>
To: Paul Stodghill <stodghil@CS.Cornell.EDU>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
Message-ID: <20021020114251.A7642@pauillac.inria.fr>
References: <r99u1jjh5k8.fsf@quimby-xp.cs.cornell.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <r99u1jjh5k8.fsf@quimby-xp.cs.cornell.edu>; from stodghil@CS.Cornell.EDU on Fri, Oct 18, 2002 at 04:09:59PM -0400
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

> Looking at the assembly produced by O'Caml and GCC, it appears that GCC
> is performance loop unrolling (as requested with -funroll-loops) and
> strength reduction in the inner loops. I can easily see why these two
> optimizations can result in such a tremendous performance difference.
> 
> My question is this: I can obviously performance loop unrolling myself
> by hand - does ocamlopt perform strength reduction?

No, it doesn't.  It doesn't hoist loop-invariant computations either.
(See http://caml.inria.fr/ocaml/numerical.html)

Multiplications are pretty expensive on modern processors, so your
code would run much faster if you hoist the computation i*120 out of
the j and k loops, and strength-reduce k*120 in the innermost
loop.

Another option worth considering is to swap the j and k loops, thus
making k*120 and a.(i*120+k) two invariants of the inner loop.  In
this case, you might not even have to strength-reduce k*120.

Other loop optimizations such as loop unrolling would (I guess) make
much less of a difference -- modern processors are very good at branch
prediction, making loop overhead low.

Some more thoughts:

Bagley's code works on a "proper" matrix (an array of
arrays), while yours flatten the matrix into one single array.  There
are pros and cons to either approach, but notice that your approach
(without strength reduction) entails fewer loads but more
multiplications, which can be more expensive...

Tiling loops is good for really large matrices, but yours (in this
example) occupies "only" 115 Kb, thus fits comfortably in L2 cache.
Perhaps tiling isn't beneficial in this case.

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners