From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 364A9BBC1 for ; Wed, 20 Feb 2008 10:38:03 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAFSEu0fBL1AZk2dsb2JhbACMOoQYAQEBAQcEBgkgnzs X-IronPort-AV: E=Sophos;i="4.25,380,1199660400"; d="scan'208";a="22820479" Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 10:38:03 +0100 Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148]) (authenticated bits=0) by exalead.com (8.14.0/8.14.0) with ESMTP id m1K9bxEr030908; Wed, 20 Feb 2008 10:38:00 +0100 Message-ID: <47BBF4F7.7000101@exalead.com> Date: Wed, 20 Feb 2008 10:37:59 +0100 From: Berke Durak User-Agent: Thunderbird 1.5.0.10 (X11/20070221) MIME-Version: 1.0 To: Francois Rouaix Cc: John Caml , caml-list@yquem.inria.fr Subject: Re: [Caml-list] large hash tables References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 hash:01 recursive:01 hashtbl:01 hashtbl:01 pcre:01 printf:01 printf:01 777777:98 stack:01 rec:01 rec:01 Francois Rouaix a écrit : > In the resizing code there is a non-tailrec function (insert_bucket). > This is most likely causing the stack overflow, as I can't see any other > non tail recursive function at first glance. Looks like it's not tail > rec in order to maintain an invariant on the order of elements. If that > invariant is not useful to you, you might want to write a slightly > different version of the Hashtbl module, where insert_bucket would be > tail rec. > Also, during resizing, memory usage will be twice the memory required > for the table (roughly), since the bucket array remains available until > the resize is completed, so all the bucket contents exist in two > versions (old and new). You might want to stick to a large initial size > and do not attempt resizing. In that casse a quick hack could also be to slightly randomize the key, as in let digest_bytes = 5 let keyify u = (String.substring (Digest.string u) 0 digest_bytes, u) > let read_whole_chan chan = > let movieMajor = Hashtbl.create 777777 in > > let rec loadLines count = > let line = input_line chan in > let murList = Pcre.split line in > match murList with > | m::u::r::[] -> > let rFloat = float_of_string r in > Hashtbl.add (keyify movieMajor) m (u, rFloat); > if (count mod 10000) == 0 then Printf.printf "count: > %d, m: %s, u: %s, r: %f \n" count m u rFloat; > loadLines (count + 1) > | _ -> raise SplitError > in > > try > loadLines 0 > with > End_of_file -> () > ;; -- Berke DURAK