From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA01974; Fri, 13 Jun 2003 20:38:53 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA02334 for ; Fri, 13 Jun 2003 20:38:52 +0200 (MET DST) Received: from walnut.he.net (walnut.he.net [64.71.137.114]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h5DIcoH19655 for ; Fri, 13 Jun 2003 20:38:50 +0200 (MET DST) Received: from localhost (kmacy@localhost) by walnut.he.net (8.8.6p2003-03-31/8.8.2) with ESMTP id LAA23180; Fri, 13 Jun 2003 11:38:51 -0700 Date: Fri, 13 Jun 2003 11:38:51 -0700 (PDT) From: Kip Macy X-Sender: kmacy@walnut.he.net To: David McClain cc: caml-list@inria.fr Subject: Re: [Caml-list] FP's and HyperThreading Processors In-Reply-To: <003601c33177$324ecc40$0201a8c0@dylan> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam: no; 0.00; kip:99 caml-list:01 unrolling:01 latency:01 kicks:01 tlb:99 allocating:01 allocator:01 locality:01 garbage:01 writes:01 ops:01 350:97 optimized:02 deeper:02 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > along with a multithreaded vendor supplied FFT routine (presumably optimized > for their processor). If it was optimized for the P2 it will by definition not be optimized for the P4, being potentially penalized by a much deeper pipeline and the use of a trace cache instead of a standard I-cache. For example loop unrolling is *bad* when you have a limited number of pre-decoded ops. Writes to the D-cache write 64 bytes, reads bring in a "sector" or 2 cache lines to try and mask the increased latency of the memory bus. The hardware pre-fetcher kicks in after you access 256 bytes sequentially. What this all translates to is that perfectly healthy data access patterns on the P2 may be pathological on the P4. And in may in part be due to the FFT. Little if any of this applies if you already have an appropriate version of the FFT for the P4. It is also worth noting that with the small L1 cache sizes on the P4, hyperthreading running data intensive programs could easily end up being a net loss with competing processes kicking out each others cache entries. As a side note you could end up being partly TLB limited if your access patterns jump around. if you are running a more recent version of Linux you might want to try putting your data on 4MB pages. > net result is that this program runs only twice as fast on the new 3 GHz P4 > as it runs on the old 350 MHz P2. > I suspect your analysis is correct, but I'd really have to try out the performance counters before I came to any conclusions. This doesn't neccessarily mean that ML is intrinsically on the wrong track with allocating new memory. It does mean that more work needs to be done to make the memory allocator and garbage collector more locality aware. There is some discussion of this in "Compiling with Continuations" by Appel. -Kip ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners