From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA01974; Fri, 13 Jun 2003 20:38:53 +0200 (MET DST)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA02334 for <caml-list@pauillac.inria.fr>; Fri, 13 Jun 2003 20:38:52 +0200 (MET DST)
Received: from walnut.he.net (walnut.he.net [64.71.137.114])
	by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h5DIcoH19655
	for <caml-list@inria.fr>; Fri, 13 Jun 2003 20:38:50 +0200 (MET DST)
Received: from localhost (kmacy@localhost) by walnut.he.net (8.8.6p2003-03-31/8.8.2) with ESMTP id LAA23180; Fri, 13 Jun 2003 11:38:51 -0700
Date: Fri, 13 Jun 2003 11:38:51 -0700 (PDT)
From: Kip Macy <kmacy@fsmware.com>
X-Sender: kmacy@walnut.he.net
To: David McClain <dmcclain1@mindspring.com>
cc: caml-list@inria.fr
Subject: Re: [Caml-list] FP's and HyperThreading Processors
In-Reply-To: <003601c33177$324ecc40$0201a8c0@dylan>
Message-ID: <Pine.LNX.4.21.0306131117030.5668-100000@walnut.he.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Spam: no; 0.00; kip:99 caml-list:01 unrolling:01 latency:01 kicks:01 tlb:99 allocating:01 allocator:01 locality:01 garbage:01 writes:01 ops:01 350:97 optimized:02 deeper:02 
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk


> along with a multithreaded vendor supplied FFT routine (presumably optimized
> for their processor).
If it was optimized for the P2 it will by definition not be optimized for
the P4, being potentially penalized by a much deeper pipeline and the use
of a trace cache instead of a standard I-cache. For example loop unrolling
is *bad* when you have a limited number of pre-decoded ops. Writes to the
D-cache write 64 bytes, reads bring in a "sector" or 2 cache lines to try
and mask the increased latency of the memory bus. The hardware pre-fetcher
kicks in after you access 256 bytes sequentially. What this all translates
to is that perfectly healthy data access patterns on the P2 may be
pathological on the P4. And in may in part be due to the FFT.  Little if
any of this applies if you already have an appropriate version of the 
FFT for the P4.

It is also worth noting that with the small L1 cache sizes on the P4,
hyperthreading running data intensive programs could easily end up being a
net loss with competing processes kicking out each others cache entries.

As a side note you could end up being partly TLB limited if your access
patterns jump around. if you are running a more recent version of Linux
you might want to try putting your data on 4MB pages.
 
> net result is that this program runs only twice as fast on the new 3 GHz P4
> as it runs on the old 350 MHz P2.
> 

I suspect your analysis is correct, but I'd really have to try out the
performance counters before I came to any conclusions. This doesn't
neccessarily mean that ML is intrinsically on the wrong track with
allocating new memory. It does mean that more work needs to be
done to make the memory allocator and garbage collector more locality 
aware. There is some discussion of this in "Compiling with 
Continuations" by Appel.


					-Kip
 

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners