From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.4 required=5.0 tests=HTML_MESSAGE,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 1BDF9BBC1 for ; Wed, 20 Feb 2008 00:01:58 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAALLtukfRVca+k2dsb2JhbACCPDWNYAEBAQEHBAQLCBaBFpYChxE X-IronPort-AV: E=Sophos;i="4.25,378,1199660400"; d="scan'208";a="22808275" Received: from rv-out-0910.google.com ([209.85.198.190]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 00:01:57 +0100 Received: by rv-out-0910.google.com with SMTP id k20so1652009rvb.3 for ; Tue, 19 Feb 2008 15:01:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; bh=E7WuvjjtsXTwy7fm1ge5DRgWfU/gx604MKjl9r9TV2E=; b=io+7dSXNSJmRy5kj9E4Eo1OXsH3f+so895WvcUvNbCVPdcixV/DJBpMI49kM/C8CQa0EfCabcJ+BUSBh6DLR/HoDw71+qBbN0MxPusc4DHAudmPsjuLpUXcrnDRHua/RqCxQwxRsC7P1ycrzMrwcr9PFQBUNNdTPjOqyhT8t+EQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=UibS/VamydtCLM+hCrDk8xXw3mKdzNPeZUQyrRcWbtQsQ1SJMilE/Mf0IwNcXiawuSPKhlReRdoogjmThHZPoFrEbTML4WN4H1B+8zEKq6tDSBsH6DIq6MXfrAZQBSaPyi/mF2wg90+c3wrVEaTz045xhsXOxbxdxgEBdlK1z28= Received: by 10.140.251.1 with SMTP id y1mr5142505rvh.102.1203462106606; Tue, 19 Feb 2008 15:01:46 -0800 (PST) Received: by 10.141.77.9 with HTTP; Tue, 19 Feb 2008 15:01:46 -0800 (PST) Message-ID: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> Date: Tue, 19 Feb 2008 15:01:46 -0800 From: "John Caml" To: caml-list@yquem.inria.fr Subject: large hash tables MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_14188_20983373.1203462106612" X-Spam: no; 0.00; hash:01 hash:01 ocaml:01 ocamlopt:01 ocamlc:01 ocaml:01 --john:01 hashtbl:01 pcre:01 hashtbl:01 printf:01 printf:01 argv:01 ocamlopt:01 ocamlc:01 ------=_Part_14188_20983373.1203462106612 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I need to load roughly 100 million items into a memory based hash table, where each key is an integer and each value is an integer and a float. Under C++ stdlib's map library, this requires 800 MB of memory. Under my naive porting of my C++ program to Ocaml, only 12 million rows get loaded before I get the program crashes with the message "Fatal error: exception Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use. (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient memory is not the issue.) Two questions: 1. How can I overcome the Stack_overflow error? (I'm already compiling with ocamlopt rather than ocamlc, which seems to have increased the stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my machine. 2. How can I implment a more efficient hash table? My Ocaml program seems to require 10x more memory than under C++. My code appears below. Thank you very much. --John -------------- exception SplitError let read_whole_chan chan = let movieMajor = Hashtbl.create 777777 in let rec loadLines count = let line = input_line chan in let murList = Pcre.split line in match murList with | m::u::r::[] -> let rFloat = float_of_string r in Hashtbl.add movieMajor m (u, rFloat); if (count mod 10000) == 0 then Printf.printf "count: %d, m: %s, u: %s, r: %f \n" count m u rFloat; loadLines (count + 1) | _ -> raise SplitError in try loadLines 0 with End_of_file -> () ;; let read_whole_file filename = let chan = open_in filename in read_whole_chan chan ;; let filename = Sys.argv.(1);; let str = read_whole_file filename;; ------=_Part_14188_20983373.1203462106612 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I need to load roughly 100 million items into a memory based hash table, where each key is an integer and each value is an integer and a float. Under C++ stdlib's map library, this requires 800 MB of memory. Under my naive porting of my C++ program to Ocaml, only 12 million rows get loaded before I get the program crashes with the message "Fatal error: exception Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use. (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient memory is not the issue.)

Two questions:

1. How can I overcome the Stack_overflow error? (I'm already compiling with ocamlopt rather than ocamlc, which seems to have increased the stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my machine.
2. How can I implment a more efficient hash table? My Ocaml program seems to require 10x more memory than under C++.

My code appears below. Thank you very much.

--John


--------------
exception SplitError

let read_whole_chan chan =
    let movieMajor = Hashtbl.create 777777 in

    let rec loadLines count =
        let line = input_line chan in
        let murList = Pcre.split line in
        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r in
                Hashtbl.add movieMajor m (u, rFloat);
                if (count mod 10000) == 0 then Printf.printf "count: %d, m: %s, u: %s, r: %f \n" count m u rFloat;
                loadLines (count + 1)
            | _ -> raise SplitError
  in
       
    try
        loadLines 0
    with
        End_of_file -> ()
    ;;

let read_whole_file filename =
    let chan = open_in filename in
    read_whole_chan chan
    ;;

let filename = Sys.argv.(1);;

let str = read_whole_file filename;;

------=_Part_14188_20983373.1203462106612--