From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id CE5B5BBC1 for ; Thu, 21 Feb 2008 23:45:24 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAALqNvUdC+VLniGdsb2JhbACQXQEBAQgEBBMWgRaYAIcd X-IronPort-AV: E=Sophos;i="4.25,388,1199660400"; d="scan'208";a="22892538" Received: from wx-out-0506.google.com ([66.249.82.231]) by mail4-smtp-sop.national.inria.fr with ESMTP; 21 Feb 2008 23:45:24 +0100 Received: by wx-out-0506.google.com with SMTP id h27so229254wxd.15 for ; Thu, 21 Feb 2008 14:45:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=zqLuXilU3hN8aHar82R++gHn+gX363mtFO89RunFfvY=; b=myybNHhQQLdR9Zjl+8m7Bq2wUVAG96L4tfQkWznRFkrcKRBZUSniOasnqAUtN/sfpwMxlBckVuRgIwA2GehlKSeHEYgdtC21yozARf8dnRJbkDh77EF+NPAtC+z3fCkLj+v+YTKs+9OEqN94bCeM2duy1ee4OgaqyBjfgclhRQI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=H9fVEGwRkgdsYe4aT/L6Pw40c0DW7CN1YLS7cqWNU6id//sJn0D+OIHosfcrn1WWEO/rrQR3ghQppSmOjyA0PS8wZ8M7v5RueRqtIKah1ZIV1fpfoowY8efJmOusq2nSd5SyOLdTKSVAlEgkkAyoytECjxv0h+iK0Ga+5GODURs= Received: by 10.141.71.8 with SMTP id y8mr7155405rvk.63.1203633921651; Thu, 21 Feb 2008 14:45:21 -0800 (PST) Received: by 10.141.77.9 with HTTP; Thu, 21 Feb 2008 14:45:21 -0800 (PST) Message-ID: <33d2b3f70802211445q7781d296ka7dd94114b8033b1@mail.gmail.com> Date: Thu, 21 Feb 2008 14:45:21 -0800 From: "John Caml" To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] large hash tables In-Reply-To: <1203522851.47bc4d2310d0c@webmail.in-berlin.de> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> <1203522851.47bc4d2310d0c@webmail.in-berlin.de> X-Spam: no; 0.00; hash:01 ocaml:01 int's:01 ocaml:01 bigarray:01 bigarray:01 elt:01 elt:01 infile:01 infile:01 pcre:01 printf:01 printf:01 stdout:01 bug:01 The equivalent C++ program uses 874 MB of memory in total. Each of the 1 million records is stored in a vector using 1 single-precision float and 1 int. Indeed, my machine is AMD64 so Ocaml int's are presumably 8 bytes. I've rewritten my Ocaml program again, this time using Bigarray. Its memory usage is now the same as under C++, so that's good news. However, my program is quite ugly now, and it's actually more than twice as long as my C++ program. Any suggestions for simplifying this program? The way I initialize the "movieMajor" Array seems especially wonky, but I couldn't figure out a better way. Thank you all very much. John ---------------------------- open Bigarray;; exception SplitError;; exception ImpossibleError;; type listPairs = {userList : int32 list; resList : float list};; type baPairs = { userBa : (int32, int32_elt, c_layout) Array1.t; resBa : (float, float32_elt, c_layout) Array1.t};; let loadWholeFile filename = let infile = open_in filename in let dummyUserBa = Array1.of_array int32 c_layout (Array.of_list []) in let dummyResBa = Array1.of_array float32 c_layout (Array.of_list []) in let dummyBas = {userBa = dummyUserBa; resBa = dummyResBa} in let movieMajor = Array.make 17770 dummyBas in let recordWholeMovie mInt curListPairs = let userBa = Array1.of_array int32 c_layout (Array.of_list curListPairs.userList) in let resBa = Array1.of_array float32 c_layout (Array.of_list curListPairs.resList) in let movieBas = {userBa = userBa; resBa = resBa} in movieMajor.(mInt) <- movieBas in let rec loadLines prevPairs prevMovie count = let line = input_line infile in let murList = Pcre.split line in match murList with | m::u::r::[] -> let rFloat = float_of_string r and mInt = int_of_string m and uInt = int_of_string u in if (count mod 1000000) = 0 then begin Printf.printf "count: %d\n" count; flush stdout; end; let newPairs = if mInt = prevMovie then {userList = Int32.of_int(uInt) :: prevPairs.userList; resList = rFloat :: prevPairs.resList} else begin recordWholeMovie mInt prevPairs; {userList = [Int32.of_int(uInt)]; resList = [rFloat]} end in loadLines newPairs mInt (count + 1) (* bug: last movie will not get loaded *) | _ -> raise SplitError in try let emptyPairs = {userList = []; resList = []} in loadLines emptyPairs (-1) 0 with End_of_file -> close_in infile; movieMajor;; let filename = Sys.argv.(1);; loadWholeFile filename;;