From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id 6829DBBC1 for ; Wed, 20 Feb 2008 09:37:41 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAPJ1u0fC2fJcnmdsb2JhbACQUgEBAQEBBgQGBwoYnzg X-IronPort-AV: E=Sophos;i="4.25,380,1199660400"; d="scan'208";a="7492342" Received: from anchor-post-34.mail.demon.net ([194.217.242.92]) by mail2-smtp-roc.national.inria.fr with ESMTP; 20 Feb 2008 09:37:41 +0100 Received: from orion.metastack.com ([80.177.38.218]) by anchor-post-34.mail.demon.net with esmtp (Exim 4.67) id 1JRkSS-000K9e-Ep for caml-list@yquem.inria.fr; Wed, 20 Feb 2008 08:37:40 +0000 Received: from countertenor (p54A3E82B.dip.t-dialin.net [84.163.232.43]) (authenticated bits=0) by orion.metastack.com (8.13.4/8.13.3) with ESMTP id m1K8J4UZ015343 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO) for ; Wed, 20 Feb 2008 08:19:08 GMT From: "David Allsopp" To: References: <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com> <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> Subject: RE: [Caml-list] Re: large hash tables Date: Wed, 20 Feb 2008 09:37:20 +0100 Organization: MetaStack Solutions Ltd. Message-ID: <00e201c8739b$d7dcb530$2101a8c0@countertenor> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 Thread-Index: AchzfYDaSu1iX8NAT6OLJjVv7pwBtAAHRAlw In-Reply-To: <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 X-Scanned-By: MIMEDefang 2.63 on 172.16.28.218 X-Spam: no; 0.00; hash:01 long-time:01 semantics:01 pervasives:01 ocaml:01 hashtbl:01 stdlib:01 bug:01 ocaml's:01 hash:01 hashtbl:01 integers:01 ocaml:01 long-time:01 infile:01 One further comment on the code below, as you've said that you're a long-time C++ coder is: > if (count mod 1000000) == 0 then begin I think you probably mean to use a single = here: although the semantics for (=) and (==) are the same for ints, I reckon you may have used == as the C/C++ equality operator. The section "Comparisons" in module Pervasives explains the difference between the = and == operators in OCaml. Additionally, there is syntactic sugar for Array.set allowing you write: > Array.set movieMajor mInt newList; as: movieMajor.(mInt) <- newList; IMHO, the Stack_overflow you saw with the Hashtbl implementation in the StdLib (assuming it really is there) is a bug and should be reported - if OCaml's got enough memory left to fill the table, then it shouldn't crash resizing, even it means that the resizing is potentially slower (a slower version could always be used if the buckets are detected to be huge or in response to a caught Stack_overflow - though that could prove tricky). David -----Original Message----- From: caml-list-bounces@yquem.inria.fr [mailto:caml-list-bounces@yquem.inria.fr] On Behalf Of John Caml Sent: 20 February 2008 06:19 To: caml-list@yquem.inria.fr Subject: [Caml-list] Re: large hash tables Thank you all for the assistance. I've resolved the Stack_overflow problem by using an Array instead of a Hashtbl; my keys were just consecutive integers, so this later approach is clearly preferable. However, the memory usage is still pretty bad...it takes nearly an order of magnitude more memory than the equivalent C++ program. While the C++ program required 800 MB, my ocaml program requires roughly 6 GB. Am I doing something very inefficiently? My revised code appears below. Also, if you have any other coding suggestions I'd appreciate hearing them. I'm a long-time coder but new to Ocaml and eager to learn. -------------- exception SplitError let loadWholeFile filename = let infile = open_in filename and movieMajor = Array.make 17770 [] in let rec loadLines count = let line = input_line infile in let murList = Pcre.split line in match murList with | m::u::r::[] -> let rFloat = float_of_string r and mInt = int_of_string m and uInt = int_of_string u in let newElement = (uInt, rFloat) and oldList = movieMajor.(mInt) in let newList = List.rev_append [newElement] oldList in Array.set movieMajor mInt newList; if (count mod 1000000) == 0 then begin Printf.printf "count: %d\n" count; flush stdout; end; loadLines (count + 1) | _ -> raise SplitError in try loadLines 0 with End_of_file -> close_in infile; movieMajor ;; let filename = Sys.argv.(1);; let str = loadWholeFile filename;; _______________________________________________ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs