From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 42C45BBC1 for ; Wed, 20 Feb 2008 14:37:09 +0100 (CET) X-IronPort-AV: E=Sophos;i="4.25,381,1199660400"; d="scan'208";a="8322098" Received: from macabane.inria.fr ([128.93.8.160]) by mail1-relais-roc.national.inria.fr with ESMTP/TLS/AES128-SHA; 20 Feb 2008 14:37:09 +0100 Cc: John Caml Message-Id: <300C2AED-0D26-438B-A45E-5635043FEE61@inria.fr> From: Damien Doligez To: Caml Mailing List In-Reply-To: <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: [Caml-list] Re: large hash tables Date: Wed, 20 Feb 2008 14:37:07 +0100 References: <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com> <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> X-Mailer: Apple Mail (2.919.2) X-Spam: no; 0.00; damien:01 damien:01 hash:01 ocaml:01 ocaml:01 arrays:01 arrays:01 bigarrays:01 bigarrays:01 20,:98 1.2:98 doligez:01 doligez:01 wrote:01 ints:01 On 2008-02-20, at 06:18, John Caml wrote: > However, the memory usage is still pretty bad...it takes nearly an > order of magnitude more memory than the equivalent C++ program. While > the C++ program required 800 MB, my ocaml program requires roughly 6 > GB. Am I doing something very inefficiently? My revised code appears > below. In order to optimize for space, you need to understand how OCaml represents things in memory. > match murList with > | m::u::r::[] -> > let rFloat = float_of_string r > and mInt = int_of_string m > and uInt = int_of_string u in > > let newElement = (uInt, rFloat) > and oldList = movieMajor.(mInt) in > let newList = List.rev_append [newElement] oldList in > Array.set movieMajor mInt newList; Here, you have an array of lists of pairs. The space used by the array is 1 + length + size of contents. The space used by a list is 3 * length + size of the contents. The space used by a pair is 3 + size of contents. The space used by an int is 0 (it's already counted in the container). The space used by a float is a bit tricky: it's normally 3, but float arrays and records of floats are special: it's 1 in this case. All these numbers are in 4-byte words, assuming a 32-bit architecture. On a 64-bit machine, the unit is 8-byte words and floats are size 2 (normally) or 0 (in arrays and records of floats). Here your memory usage will be : 1 + 17770 + 300M + 300M + 200M = 800M, or about 6.4 GB (assuming you have a 64-bit machine). You can do much better by using a tuple of arrays instead of an array of lists of tuples. If you use three arrays (for m, u, and r), read all the data in the arrays, and then do a radix sort (and fill a small indexing array), you'll be down to 2.4 GB of memory. If that's still too much, you could use bigarrays with 32-bit ints and floats, and get down to 1.2 GB. I don't see how you can get 800M in your C program unless your ints are 16-bit, in which case you can do it with bigarrays too. The only problem with all this, is that you'll have to write the sorting code yourself. In conclusion OCaml is not really well-suited to tight memory situations, but usually you can manage. -- Damien