From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id NAA15351 for caml-redistribution; Thu, 3 Sep 1998 13:59:26 +0200 (MET DST) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id TAA18559 for ; Wed, 2 Sep 1998 19:24:00 +0200 (MET DST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.8.7/8.8.7) with ESMTP id TAA00307; Wed, 2 Sep 1998 19:23:57 +0200 (MET DST) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA18006; Wed, 2 Sep 1998 19:23:57 +0200 (MET DST) Message-ID: <19980902192356.36292@pauillac.inria.fr> Date: Wed, 2 Sep 1998 19:23:56 +0200 From: Xavier Leroy To: Todd Graham Lewis , caml-list@inria.fr Subject: Re: VLIW & caml: how? References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.89.1 In-Reply-To: ; from Todd Graham Lewis on Fri, Aug 28, 1998 at 01:18:34AM -0400 Sender: weis > I've been reading that VLIW as implemented on the IA-64/Merced will post > problems for conventional compilers such as gcc which don't have a very > expansive view of the code they're compiling. How well will o'caml deal > with optimizing for this sort of architecture? Any thoughts? It's hard to say anything precise until Intel releases detailed documentation on the IA64 instruction set. If your question is about instruction-level parallelism (ILP) in general, it must be noted that today's superscalar architectures (ushc as the Alpha 21264 and the PowerPC 604) already offer more parallelism (i.e. 4 instructions issued per cycle) than can be exploited by most compiled programs. This is due in part to insufficient optimizations in compilers (extracting ILP from sequential code might require significant program transformations) and in part to the fact that many programs simply do not contain enough parallelism by nature of the algorithms used. Often, the only way to exploit fully the resources of those superscalar processors is to write carefully tuned assembly code by hand... Code generated by ocamlopt has characteristics similar to the so-called "commercial workload" subset of Spec95, i.e. high number of memory accesses, low to medium ILP, and relatively low CPI. This is not surprising, as hardware manufacturers generally increase ILP by throwing more integer and floating-point ALUs, which are not useful for most Caml applications, but don't increase the number of load-store units, which would be good for Caml but is very hard to implement in hardware. However, there is some hope that the clean semantics of Caml might allow more aggressive scheduling of memory accesses as is possible with e.g. C programs. In particular, the type system gives a lot of non-aliasing properties "for free" (e.g. a load from an immutable data structure cannot interfere with a non-initializing store). See my PLDI'98 tutorial for more details (http://pauillac.inria.fr/~xleroy/). But again, this can be useful only if the hardware supports many pending memory accesses simultaneously. All in all, I'm not expecting much speedups from ILP. The important speedups we've observed on Caml programs when moving from older architectures (e.g. the Alpha 21064) to newer ones (e.g. the Alpha 21164 or PowerPC G3) are due to better caches and faster memory subsystems much more than to increased on-chip parallelism. - Xavier Leroy