From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.2 required=5.0 tests=AWL,HTML_MESSAGE, MAILTO_TO_SPAM_ADDR,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 2DBEFBBCB for ; Wed, 20 Feb 2008 00:51:12 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAAb6ukdIDvbyi2dsb2JhbACCPDWNYAEBAQgEBAkKEQWBFpYAhwU X-IronPort-AV: E=Sophos;i="4.25,378,1199660400"; d="scan'208";a="22809220" Received: from ag-out-0708.google.com ([72.14.246.242]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 00:51:10 +0100 Received: by ag-out-0708.google.com with SMTP id 31so3455038agc.3 for ; Tue, 19 Feb 2008 15:51:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; bh=BvYAoUlhQsvbvngHTct8N1GV3QGWqXaRs368050+AaU=; b=gh6wm2GPBH55ZDFigHXiK1+EMks2ViwyzbUxFc1pXO3IAgCPIayQRo3CRjZjWpnD7KMi+hxrwKWsLNI3Vzl1i1+Pm+EB2a1M/V5rTV35maEil2QvIC0b2pPwOpR9MEQetKAaIrSuafQYc9V/nNruQUZSj5u8qXLVuLwlIn4XkNI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=TXj/CEFRgEmtQ1jRyLlKwm0+r6lCzWzLazpW2SMKKmtSsg4WRtU1eqYqfZgZd6OMebZB5n2Bc+HnW0sQKkdZYB3xXgcALmQ8/Y9wLGQcPlnAz1lFzZQ+VGzLqefwmjQmFBsZrmueoS7tkPsSO+viKVvxJ5/msr7Ricy2zbXEClI= Received: by 10.115.79.1 with SMTP id g1mr6701785wal.43.1203465068396; Tue, 19 Feb 2008 15:51:08 -0800 (PST) Received: by 10.114.166.15 with HTTP; Tue, 19 Feb 2008 15:51:08 -0800 (PST) Message-ID: Date: Tue, 19 Feb 2008 15:51:08 -0800 From: "Francois Rouaix" To: "John Caml" Subject: Re: [Caml-list] large hash tables Cc: caml-list@yquem.inria.fr In-Reply-To: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_21553_10406502.1203465068392" References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> X-Spam: no; 0.00; hash:01 recursive:01 hashtbl:01 --f:01 hash:01 ocaml:01 ocamlopt:01 ocamlc:01 ocaml:01 --john:01 hashtbl:01 pcre:01 printf:01 printf:01 argv:01 ------=_Part_21553_10406502.1203465068392 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline In the resizing code there is a non-tailrec function (insert_bucket). This is most likely causing the stack overflow, as I can't see any other non tail recursive function at first glance. Looks like it's not tail rec in order to maintain an invariant on the order of elements. If that invariant is not useful to you, you might want to write a slightly different version of the Hashtbl module, where insert_bucket would be tail rec. Also, during resizing, memory usage will be twice the memory required for the table (roughly), since the bucket array remains available until the resize is completed, so all the bucket contents exist in two versions (old and new). You might want to stick to a large initial size and do not attempt resizing. --f 2008/2/19 John Caml : > I need to load roughly 100 million items into a memory based hash table, > where each key is an integer and each value is an integer and a float. Under > C++ stdlib's map library, this requires 800 MB of memory. Under my naive > porting of my C++ program to Ocaml, only 12 million rows get loaded before I > get the program crashes with the message "Fatal error: exception > Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use. > (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient > memory is not the issue.) > > Two questions: > > 1. How can I overcome the Stack_overflow error? (I'm already compiling > with ocamlopt rather than ocamlc, which seems to have increased the stack > size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my > machine. > 2. How can I implment a more efficient hash table? My Ocaml program seems > to require 10x more memory than under C++. > > My code appears below. Thank you very much. > > --John > > > -------------- > exception SplitError > > let read_whole_chan chan = > let movieMajor = Hashtbl.create 777777 in > > let rec loadLines count = > let line = input_line chan in > let murList = Pcre.split line in > match murList with > | m::u::r::[] -> > let rFloat = float_of_string r in > Hashtbl.add movieMajor m (u, rFloat); > if (count mod 10000) == 0 then Printf.printf "count: %d, > m: %s, u: %s, r: %f \n" count m u rFloat; > loadLines (count + 1) > | _ -> raise SplitError > in > > try > loadLines 0 > with > End_of_file -> () > ;; > > let read_whole_file filename = > let chan = open_in filename in > read_whole_chan chan > ;; > > let filename = Sys.argv.(1);; > > let str = read_whole_file filename;; > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > ------=_Part_21553_10406502.1203465068392 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline In the resizing code there is a non-tailrec function (insert_bucket). This is most likely causing the stack overflow, as I can't see any other non tail recursive function at first glance. Looks like it's not tail rec in order to maintain an invariant on the order of elements. If that invariant is not useful to you, you might want to write a slightly different version of the Hashtbl module, where insert_bucket would be tail rec.
Also, during resizing, memory usage will be twice the memory required for the table (roughly), since the bucket array remains available until the resize is completed, so all the bucket contents exist in two versions (old and new). You might want to stick to a large initial size and do not attempt resizing.
--f


2008/2/19 John Caml <camljohn42@gmail.com>:
I need to load roughly 100 million items into a memory based hash table, where each key is an integer and each value is an integer and a float. Under C++ stdlib's map library, this requires 800 MB of memory. Under my naive porting of my C++ program to Ocaml, only 12 million rows get loaded before I get the program crashes with the message "Fatal error: exception Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use. (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient memory is not the issue.)

Two questions:

1. How can I overcome the Stack_overflow error? (I'm already compiling with ocamlopt rather than ocamlc, which seems to have increased the stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my machine.
2. How can I implment a more efficient hash table? My Ocaml program seems to require 10x more memory than under C++.

My code appears below. Thank you very much.

--John


--------------
exception SplitError

let read_whole_chan chan =
    let movieMajor = Hashtbl.create 777777 in

    let rec loadLines count =
        let line = input_line chan in
        let murList = Pcre.split line in
        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r in
                Hashtbl.add movieMajor m (u, rFloat);
                if (count mod 10000) == 0 then Printf.printf "count: %d, m: %s, u: %s, r: %f \n" count m u rFloat;
                loadLines (count + 1)
            | _ -> raise SplitError
  in
       
    try
        loadLines 0
    with
        End_of_file -> ()
    ;;

let read_whole_file filename =
    let chan = open_in filename in
    read_whole_chan chan
    ;;

let filename = Sys.argv.(1);;

let str = read_whole_file filename;;


_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


------=_Part_21553_10406502.1203465068392--