caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Richard Jones <rich@annexia.org>
To: John Caml <camljohn42@gmail.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] large hash tables
Date: Wed, 20 Feb 2008 12:48:41 +0000	[thread overview]
Message-ID: <20080220124841.GA25817@annexia.org> (raw)
In-Reply-To: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com>

On Tue, Feb 19, 2008 at 03:01:46PM -0800, John Caml wrote:
> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory.

You can do as well as C++, but you need to understand how values are
stored in the OCaml runtime.

In C++ you're using an int and a (single-precision) float, and
presumably you're storing them packed into an array.  Something like
this?

  struct item { int i; float f; };
  struct item array[100000000];

The storage requirements for this are 100 * 10^6 * 8 bytes = 800 MB,
as you say.

In OCaml things are a bit different.  Firstly you are using
double-precision floats ("float" in OCaml is the same as double in C),
so those are 8 bytes.  Secondly you're using OCaml ints, and since you
must be on a 64 bit machine, those are 64 bits wide (although in OCaml
you can only use 63 of those bits).  So the minimum for storing an int
and a float in OCaml is 16 bytes.

It gets worse though because you're using a (int * float) tuple.
These are stored particularly inefficiently in OCaml.  There is an 8
byte header, 8 bytes for the int, then the float is a boxed pointer
[stored in a separate allocation] which means an 8 byte pointer,
another 8 byte header plus 8 bytes for the double-precision float.  So
if I've got the math right, the cost of storing each tuple would be:

  8 +              8 +   8 +               8 +              8
  header of tuple  int   pointer to float  header of float  float
  = 40 bytes.

Storing those in a simple array means another pointer per tuple
(tuples aren't packed into the array, but pointed to), so that's a
total of around:

  100 * 10^6 * (8 + 40) = 4.8 GB.

Luckily there are much better ways to store this data in OCaml.  If
you want to keep it simple, use two arrays, a 'float array' to store
double-precision floats and an 'int array' to store 63 bit ints.
These are both unboxed in OCaml, so this works out as:

  float array:   100 * 10^6 * 8
  int array:   + 100 * 10^6 * 8 = 1.6 GB

You can do better (the same as C++) by using Bigarray.  The basic
saving of Bigarray is that you can store single-precision float and 32
bit integers directly in them.  So either go for two Bigarrays, or use
one Bigarray (of int32_elt type and length 200 million elements) and a
module wrapping the details of inserting and getting the pairs (also
the functions Int32.bits_of_float and Int32.float_of_bits will help
you here).

Rich.

-- 
Richard Jones
Red Hat


  parent reply	other threads:[~2008-02-20 12:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-19 23:01 John Caml
2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
2008-02-19 23:36 ` Gerd Stolpmann
2008-02-19 23:51 ` Francois Rouaix
2008-02-20  9:37   ` Berke Durak
2008-02-20  9:56     ` Berke Durak
2008-02-20 12:48 ` Richard Jones [this message]
2008-02-20 15:54 ` Oliver Bandel
2008-02-21 22:45   ` John Caml
2008-02-22  0:33     ` Richard Jones
2008-02-24  5:39       ` John Caml
2008-02-22 14:19     ` Brian Hurt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080220124841.GA25817@annexia.org \
    --to=rich@annexia.org \
    --cc=caml-list@yquem.inria.fr \
    --cc=camljohn42@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).