From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id 3AEAEBC73 for ; Fri, 1 Jun 2007 16:57:50 +0200 (CEST) Received: from smtp.janestcapital.com (janestcapital.com [66.155.124.107]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l51Evmfi000695 for ; Fri, 1 Jun 2007 16:57:49 +0200 Received: from [172.25.129.161] [38.96.172.125] by janestcapital.com with ESMTP (SMTPD-9.10) id A3F8030C; Fri, 01 Jun 2007 10:58:00 -0400 Message-ID: <466033EC.3050909@janestcapital.com> Date: Fri, 01 Jun 2007 10:57:48 -0400 From: Brian Hurt User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Stephen Weeks Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Comparison of OCaml and MLton for numerics References: <5195a210705302250u6a9e5adey4ed857480f9e5cd8@mail.gmail.com> <891bd3390706010429g2ac722bfmc6932b15393a62b9@mail.gmail.com> <20070601214326.e0a939a6.mle+ocaml@mega-nerd.com> <200706011258.59177.jon@ffconsultancy.com> <604682010706010718x42221e8rde56317905f5c972@mail.gmail.com> In-Reply-To: <604682010706010718x42221e8rde56317905f5c972@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Miltered: at discorde with ID 466033EC.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; ocaml:01 speedup:01 inlining:01 ocamlopt:01 -inline:01 ocaml:01 inlining:01 inlined:01 inlined:01 trivial:01 pervasives:01 compiler:01 compiler:01 decreased:98 decreased:98 Stephen Weeks wrote: >> However, in some cases, defunctorization may produce a good speedup, >> especially if you use massive inlining (e.g. ocamlopt -inline 1000). >> On the >> contrary, defunctorization may produce cache problem because the size >> of the >> defunctorized code may be very bigger than the size of the initial code. > > > I've never observed this problem in practice using MLton, and don't > know anyone > in the MLton world who has. Has this actually been observed using the > OCaml > defunctorizer? > Not with the Ocaml defunctorizor, but in other contexts I have indeed seen issues where inlining functions signifigantly decreased performance, due to cache thrashing. And I know people (my dad) who've seen program sizes reduce by a factor of 3 with a one *word* change in the source code. Short story: A base class in a large C++ function had an inline virtual destructor, which then had to be inlined everywhere in the code where an object that inherited from that class was being freed. Removing the inline keyword signifigantly increased performance and radically decreased code size. The code change was opposed because "inlining functions makes code faster". Another example I've seen, although it's smaller, is in branch prediction. CPUs keep track, per branch, of wether branches tend to be taken or not. Branch prediction is then used to speculatively execute code- but the problem is that if they're mispredicted, the cost is large (10-20+ clock cycles, smaller than the 100-350+ clock cycles of a cache miss, but still signifigant compared to the cost of a function call). They only keep track of a limited number of branches, however. By inlining, and duplicating, the code, you're putting more pressure on the branch prediction logic, and are having more branches be mispredicted, with associated cost. My experience has been that inlining is only a win in three cases: 1) where the function being inlined is so trivial and so small that the size and cost of the function call is the same as the rest of the function. Given that the size of a function call to a known location is like 5 bytes on the x86, and the cost of a function call the last time I measured it was like 1.5 cycles for the call, plus 1-2 cycles per argument, I mean really effin small and simple functions here. Or, 2) where the function is only called from one place, or 3) where inlining opened up signifigant other optimization opportunities. The classic example for Ocaml here is replacing a call to Pervasives.compare with an integer compare. Most of the rest of the time, inlining is either a break even proposition, or often a loss. Which is why I consider Linus Torvals "real programmer" attitude dumb. In the first two cases, the compiler can easily determine that the inlining is a good idea. Counting the cost or size of a function is easy enough, and counting the number of places where the function is called from real easy. And the third case, where inlining opens up new possibilities for optimization- that almost has to be done by the compiler, as it depends upon what optimizations the compiler can, and will, apply to the newly inlined function. This is something I trust the compiler to do more than I trust even me to do correctly. Brian