From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 80CB5BBC1 for ; Fri, 22 Feb 2008 01:33:21 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAJSmvUdQRFuwh2dsb2JhbACQXwEBAQgKKYEUnzY X-IronPort-AV: E=Sophos;i="4.25,388,1199660400"; d="scan'208";a="9448728" Received: from furbychan.cocan.org ([80.68.91.176]) by mail3-smtp-sop.national.inria.fr with ESMTP; 22 Feb 2008 01:33:21 +0100 Received: from rich by furbychan.cocan.org with local (Exim 4.63) (envelope-from ) id 1JSLql-00029p-AH; Fri, 22 Feb 2008 00:33:15 +0000 Date: Fri, 22 Feb 2008 00:33:15 +0000 To: John Caml Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] large hash tables Message-ID: <20080222003315.GA5326@annexia.org> References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> <1203522851.47bc4d2310d0c@webmail.in-berlin.de> <33d2b3f70802211445q7781d296ka7dd94114b8033b1@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <33d2b3f70802211445q7781d296ka7dd94114b8033b1@mail.gmail.com> User-Agent: Mutt/1.5.13 (2006-08-11) From: Richard Jones X-Spam: no; 0.00; hash:01 bigarray:01 chunks:01 caml-list:01 data:02 data:02 idiomatic:02 module:03 module:03 overhead:04 tmp:05 tmp:05 interface:06 allocate:06 million:93 Mine version's a bit longer than your version, but hopefully more idiomatic and easier to understand. Program - http://www.annexia.org/tmp/movies.ml Create the test file - http://www.annexia.org/tmp/make_movies.ml It's best to read the program like this: (1) Start with the _interface_ ('signature') of the new ExtArray1 module & type. _Ignore_ the implementation of this module for now. (2) Then look at the main part of the program (from where we allocate the result array down through the loop which reads the data). (3) Then look at the implementation of the module. The main complexity is that you can't just extend a Bigarray, but you have to keep reallocating it (in large chunks for efficiency). I measured it as taking some 230 MB for a 10 million line data file, but that doesn't necessarily mean it'll take 2 GB for 100 million lines because there's some space overhead which will decline as a proportion of the total memory used. Rich. -- Richard Jones Red Hat