caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Damien Doligez <damien.doligez@inria.fr>
To: Caml Mailing List <caml-list@yquem.inria.fr>
Cc: John Caml <camljohn42@gmail.com>
Subject: Re: [Caml-list] Re: large hash tables
Date: Wed, 20 Feb 2008 14:37:07 +0100	[thread overview]
Message-ID: <300C2AED-0D26-438B-A45E-5635043FEE61@inria.fr> (raw)
In-Reply-To: <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com>


On 2008-02-20, at 06:18, John Caml wrote:

> However, the memory usage is still pretty bad...it takes nearly an
> order of magnitude more memory than the equivalent C++ program. While
> the C++ program required 800 MB, my ocaml program requires roughly 6
> GB. Am I doing something very inefficiently? My revised code appears
> below.

In order to optimize for space, you need to understand how OCaml
represents things in memory.

>        match murList with
>            | m::u::r::[] ->
>                let rFloat = float_of_string r
>                and mInt = int_of_string m
>                and uInt = int_of_string u in
>
>                let newElement = (uInt, rFloat)
>                and oldList = movieMajor.(mInt) in
>                let newList = List.rev_append [newElement] oldList in
>                Array.set movieMajor mInt newList;


Here, you have an array of lists of pairs.

The space used by the array is 1 + length + size of contents.

The space used by a list is 3 * length + size of the contents.

The space used by a pair is 3 + size of contents.

The space used by an int is 0 (it's already counted in the container).

The space used by a float is a bit tricky: it's normally 3, but
float arrays and records of floats are special: it's 1 in this case.

All these numbers are in 4-byte words, assuming a 32-bit architecture.
On a 64-bit machine, the unit is 8-byte words and floats are size 2
(normally) or 0 (in arrays and records of floats).

Here your memory usage will be : 1 + 17770 + 300M + 300M + 200M = 800M,
or about 6.4 GB (assuming you have a 64-bit machine).

You can do much better by using a tuple of arrays instead of an array
of lists of tuples.  If you use three arrays (for m, u, and r), read
all the data in the arrays, and then do a radix sort (and fill a small
indexing array), you'll be down to 2.4 GB of memory.  If that's still
too much, you could use bigarrays with 32-bit ints and floats, and get
down to 1.2 GB.

I don't see how you can get 800M in your C program unless your ints
are 16-bit, in which case you can do it with bigarrays too.

The only problem with all this, is that you'll have to write the
sorting code yourself.

In conclusion OCaml is not really well-suited to tight memory
situations, but usually you can manage.

-- Damien


  parent reply	other threads:[~2008-02-20 13:37 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fa.XXbywsQknpl7bhlesWN8vFLM58c@ifi.uio.no>
     [not found] ` <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
2008-02-20  5:18   ` John Caml
2008-02-20  6:11     ` [Caml-list] " Francois Rouaix
2008-02-20  8:37     ` David Allsopp
2008-02-20  8:44       ` Alain Frisch
2008-02-20 13:37     ` Damien Doligez [this message]
2008-02-20 14:37       ` Oliver Bandel
2008-02-20 16:02     ` Christopher L Conway
2008-02-21 13:54       ` Damien Doligez
2008-02-21 16:40         ` Christopher L Conway

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=300C2AED-0D26-438B-A45E-5635043FEE61@inria.fr \
    --to=damien.doligez@inria.fr \
    --cc=caml-list@yquem.inria.fr \
    --cc=camljohn42@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).