From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 2F7B7BBC1 for ; Wed, 20 Feb 2008 10:56:25 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAMOHu0fBL1AZk2dsb2JhbACMOoQYAQEBAQcEBgkgn0o X-IronPort-AV: E=Sophos;i="4.25,380,1199660400"; d="scan'208";a="22821404" Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 10:56:24 +0100 Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148]) (authenticated bits=0) by exalead.com (8.14.0/8.14.0) with ESMTP id m1K9uNQ7031590; Wed, 20 Feb 2008 10:56:23 +0100 Message-ID: <47BBF947.9010306@exalead.com> Date: Wed, 20 Feb 2008 10:56:23 +0100 From: Berke Durak User-Agent: Thunderbird 1.5.0.10 (X11/20070221) MIME-Version: 1.0 To: Berke Durak Cc: Francois Rouaix , caml-list@yquem.inria.fr Subject: Re: [Caml-list] large hash tables References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> <47BBF4F7.7000101@exalead.com> In-Reply-To: <47BBF4F7.7000101@exalead.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 hash:01 recursive:01 hashtbl:01 hashtbl:01 pcre:01 printf:01 printf:01 hash:01 777777:98 stack:01 rec:01 Berke Durak a écrit : > Francois Rouaix a écrit : >> In the resizing code there is a non-tailrec function (insert_bucket). >> This is most likely causing the stack overflow, as I can't see any >> other non tail recursive function at first glance. Looks like it's not >> tail rec in order to maintain an invariant on the order of elements. >> If that invariant is not useful to you, you might want to write a >> slightly different version of the Hashtbl module, where insert_bucket >> would be tail rec. >> Also, during resizing, memory usage will be twice the memory required >> for the table (roughly), since the bucket array remains available >> until the resize is completed, so all the bucket contents exist in two >> versions (old and new). You might want to stick to a large initial >> size and do not attempt resizing. > > In that casse a quick hack could also be to slightly randomize the key, > as in > > let digest_bytes = 5 > > let keyify u = > (String.substring (Digest.string u) 0 digest_bytes, u) > > > let read_whole_chan chan = > > let movieMajor = Hashtbl.create 777777 in > > > > let rec loadLines count = > > let line = input_line chan in > > let murList = Pcre.split line in > > match murList with > > | m::u::r::[] -> > > let rFloat = float_of_string r in > > Hashtbl.add (keyify movieMajor) m (u, rFloat); > > if (count mod 10000) == 0 then Printf.printf "count: > > %d, m: %s, u: %s, r: %f \n" count m u rFloat; > > loadLines (count + 1) > > | _ -> raise SplitError > > in > > > > try > > loadLines 0 > > with > > End_of_file -> () > > ;; > That was stupid. Actually that's a solution to another problem, as the overflow is probably not due to hash collisions - we are talking of the same keys here. But I have a solution: do not use the Hashtbl to store multiple entries per key; use a Queue per entry for that. Queues include some Obj.magic and are actually pretty cheap. let q = try Hashtbl.find movieMajor m with | Not_found -> let q = Queue.create () in Hashtbl.add movieMajor m q; q in Queue.push (u, rFloat) q -- Berke DURAK