From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 80CC9BC69 for ; Sun, 22 Apr 2007 12:23:29 +0200 (CEST) Received: from smtp6-g19.free.fr (smtp6-g19.free.fr [212.27.42.36]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l3MANTW4025702 for ; Sun, 22 Apr 2007 12:23:29 +0200 Received: from [192.168.1.2] (che78-2-82-237-71-191.fbx.proxad.net [82.237.71.191]) by smtp6-g19.free.fr (Postfix) with ESMTP id B438A6D5F0; Sun, 22 Apr 2007 12:23:28 +0200 (CEST) Message-ID: <462B37A0.4050205@inria.fr> Date: Sun, 22 Apr 2007 12:23:28 +0200 From: Xavier Leroy User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050317) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Markus Mottl Cc: ocaml , yaron jane Subject: Re: [Caml-list] Slow allocations with 64bit code? References: In-Reply-To: X-Enigmail-Version: 0.91.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 462B37A1.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; allocations:01 allocations:01 arrays:01 computations:01 locality:01 model:01 ocaml:01 timings:01 alloc:01 alignment:01 ocamlopt:01 model:01 alloc:01 heap:01 slower:01 > I wonder whether others have already noticed that allocations may > surprisingly be slower on 64bit platforms than on 32bit ones. As already mentioned, on 64-bit platforms almost all Caml data representations are twice as large as on 32-bit platforms (exceptions: strings, float arrays), so the processor has twice as much data to move through its memory subsystem. However, you certainly don't get a slowdown by a factor of 2, for two reasons: 1- the processor doesn't spend all its time doing memory accesses, there are some computations here and there; 2- cache lines are much bigger than 32 bits, meaning that accessing 64 bits at a given address is much cheaper than accessing two 32-bit quantities at two random addresses (spatial locality). Moreover, x86 in 64-bit mode is much more compiler-friendly than in 32-bit mode: twice as many registers, a sensible floating-point model at last. So, OCaml in 64-bit mode generates better code than in 32-bit mode. All in all, your 10% slowdown seems reasonable and in line with what others reported using C benchmarks. > This is only a difference of about 10%, but I have seen more complex > cases where there are timing differences in excess of 50%, which is > already pretty substantial. Be careful with timings: I've seen simple changes in code placement (e.g. introducing or removing dead code) cause performance differences in excess of 20%. It's an unfortunate fact of today's processors that their performance is very hard to predict. > Looking at the assembly, there is really no difference in the loop > other than the use of the quad word instructions, which should not > take longer on the exact same platform (i.e. same CPU-frequency). But > there is a suspicious call to "caml_alloc2", which might cause these > differences. Can it be that there are alignment problems or similar > in the run time? ocamlopt compiles module initialization code in the so-called "compact" model, where code size is reduced by not open-coding some operations such as heap allocation, but instead going through auxiliary functions like "caml_alloc2". This makes sense since initialization code is usually large but not performance-critical. I recommend you put performance-critical code in functions, not in the initialization code. - Xavier Leroy