caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Re: large hash tables
       [not found] ` <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
@ 2008-02-20  5:18   ` John Caml
  2008-02-20  6:11     ` [Caml-list] " Francois Rouaix
                       ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: John Caml @ 2008-02-20  5:18 UTC (permalink / raw)
  To: caml-list

Thank you all for the assistance.

I've resolved the Stack_overflow problem by using an Array instead of
a Hashtbl; my keys were just consecutive integers, so this later
approach is clearly preferable.

However, the memory usage is still pretty bad...it takes nearly an
order of magnitude more memory than the equivalent C++ program. While
the C++ program required 800 MB, my ocaml program requires roughly 6
GB. Am I doing something very inefficiently? My revised code appears
below.

Also, if you have any other coding suggestions I'd appreciate hearing
them. I'm a long-time coder but new to Ocaml and eager to learn.


--------------

exception SplitError


let loadWholeFile filename =
    let infile = open_in filename
    and movieMajor = Array.make 17770 [] in

    let rec loadLines count =
        let line = input_line infile in
        let murList = Pcre.split line in

        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r
                and mInt = int_of_string m
                and uInt = int_of_string u in

                let newElement = (uInt, rFloat)
                and oldList = movieMajor.(mInt) in
                let newList = List.rev_append [newElement] oldList in
                Array.set movieMajor mInt newList;

                if (count mod 1000000) == 0 then begin
                    Printf.printf "count: %d\n" count;
                    flush stdout;
                    end;

                    loadLines (count + 1)

            | _ -> raise SplitError
  in

    try
        loadLines 0
    with
        End_of_file -> close_in infile;
        movieMajor
;;


let filename = Sys.argv.(1);;
let str = loadWholeFile filename;;


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20  5:18   ` large hash tables John Caml
@ 2008-02-20  6:11     ` Francois Rouaix
  2008-02-20  8:37     ` David Allsopp
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Francois Rouaix @ 2008-02-20  6:11 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2870 bytes --]

Well, you could use resizeable arrays instead of lists for each bucket.If
you have 100M items, each "cons" becomes fairly expensive. A pointer per
item is 400MB...
I'm a bit surprised that the C++ program only required 800MB - that would be
8 bytes exactly per item; if each item is an int (4 bytes) and a double (8
bytes), it doesn't add up. Or are you using single precision floats and
arrays everywhere (no objects, structs of any kind) ?

The most memory efficient representation in OCaml would probably be a couple
of arrays, ints and floats. For an item indexed by j, the int value is
ints.(j) and the float value is in floats.(j).
ints.(j) == [| i0; ... |]
floats.(j) == [ f0; ... |]

--f


On Feb 19, 2008 9:18 PM, John Caml <camljohn42@gmail.com> wrote:

> Thank you all for the assistance.
>
> I've resolved the Stack_overflow problem by using an Array instead of
> a Hashtbl; my keys were just consecutive integers, so this later
> approach is clearly preferable.
>
> However, the memory usage is still pretty bad...it takes nearly an
> order of magnitude more memory than the equivalent C++ program. While
> the C++ program required 800 MB, my ocaml program requires roughly 6
> GB. Am I doing something very inefficiently? My revised code appears
> below.
>
> Also, if you have any other coding suggestions I'd appreciate hearing
> them. I'm a long-time coder but new to Ocaml and eager to learn.
>
>
> --------------
>
> exception SplitError
>
>
> let loadWholeFile filename =
>    let infile = open_in filename
>    and movieMajor = Array.make 17770 [] in
>
>    let rec loadLines count =
>        let line = input_line infile in
>        let murList = Pcre.split line in
>
>        match murList with
>            | m::u::r::[] ->
>                let rFloat = float_of_string r
>                and mInt = int_of_string m
>                and uInt = int_of_string u in
>
>                let newElement = (uInt, rFloat)
>                and oldList = movieMajor.(mInt) in
>                let newList = List.rev_append [newElement] oldList in
>                Array.set movieMajor mInt newList;
>
>                if (count mod 1000000) == 0 then begin
>                    Printf.printf "count: %d\n" count;
>                    flush stdout;
>                    end;
>
>                    loadLines (count + 1)
>
>            | _ -> raise SplitError
>  in
>
>    try
>        loadLines 0
>    with
>        End_of_file -> close_in infile;
>        movieMajor
> ;;
>
>
> let filename = Sys.argv.(1);;
> let str = loadWholeFile filename;;
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 4644 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Caml-list] Re: large hash tables
  2008-02-20  5:18   ` large hash tables John Caml
  2008-02-20  6:11     ` [Caml-list] " Francois Rouaix
@ 2008-02-20  8:37     ` David Allsopp
  2008-02-20  8:44       ` Alain Frisch
  2008-02-20 13:37     ` Damien Doligez
  2008-02-20 16:02     ` Christopher L Conway
  3 siblings, 1 reply; 10+ messages in thread
From: David Allsopp @ 2008-02-20  8:37 UTC (permalink / raw)
  To: caml-list

One further comment on the code below, as you've said that you're a
long-time C++ coder is:

> if (count mod 1000000) == 0 then begin

I think you probably mean to use a single = here: although the semantics for
(=) and (==) are the same for ints, I reckon you may have used == as the
C/C++ equality operator. The section "Comparisons" in module Pervasives
explains the difference between the = and == operators in OCaml.

Additionally, there is syntactic sugar for Array.set allowing you write:

> Array.set movieMajor mInt newList;

as:

movieMajor.(mInt) <- newList;

IMHO, the Stack_overflow you saw with the Hashtbl implementation  in the
StdLib (assuming it really is there) is a bug and should be reported - if
OCaml's got enough memory left to fill the table, then it shouldn't crash
resizing, even it means that the resizing is potentially slower (a slower
version could always be used if the buckets are detected to be huge or in
response to a caught Stack_overflow - though that could prove tricky).


David

-----Original Message-----
From: caml-list-bounces@yquem.inria.fr
[mailto:caml-list-bounces@yquem.inria.fr] On Behalf Of John Caml
Sent: 20 February 2008 06:19
To: caml-list@yquem.inria.fr
Subject: [Caml-list] Re: large hash tables

Thank you all for the assistance.

I've resolved the Stack_overflow problem by using an Array instead of
a Hashtbl; my keys were just consecutive integers, so this later
approach is clearly preferable.

However, the memory usage is still pretty bad...it takes nearly an
order of magnitude more memory than the equivalent C++ program. While
the C++ program required 800 MB, my ocaml program requires roughly 6
GB. Am I doing something very inefficiently? My revised code appears
below.

Also, if you have any other coding suggestions I'd appreciate hearing
them. I'm a long-time coder but new to Ocaml and eager to learn.


--------------

exception SplitError


let loadWholeFile filename =
    let infile = open_in filename
    and movieMajor = Array.make 17770 [] in

    let rec loadLines count =
        let line = input_line infile in
        let murList = Pcre.split line in

        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r
                and mInt = int_of_string m
                and uInt = int_of_string u in

                let newElement = (uInt, rFloat)
                and oldList = movieMajor.(mInt) in
                let newList = List.rev_append [newElement] oldList in
                Array.set movieMajor mInt newList;

                if (count mod 1000000) == 0 then begin
                    Printf.printf "count: %d\n" count;
                    flush stdout;
                    end;

                    loadLines (count + 1)

            | _ -> raise SplitError
  in

    try
        loadLines 0
    with
        End_of_file -> close_in infile;
        movieMajor
;;


let filename = Sys.argv.(1);;
let str = loadWholeFile filename;;

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20  8:37     ` David Allsopp
@ 2008-02-20  8:44       ` Alain Frisch
  0 siblings, 0 replies; 10+ messages in thread
From: Alain Frisch @ 2008-02-20  8:44 UTC (permalink / raw)
  To: David Allsopp; +Cc: caml-list

David Allsopp wrote:
> resizing, even it means that the resizing is potentially slower (a slower
> version could always be used if the buckets are detected to be huge or in
> response to a caught Stack_overflow - though that could prove tricky).

The detection of Stack_overflow does not work well for all the supported 
systems. A library should not assume that the exception will be raised 
properly.

-- Alain


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20  5:18   ` large hash tables John Caml
  2008-02-20  6:11     ` [Caml-list] " Francois Rouaix
  2008-02-20  8:37     ` David Allsopp
@ 2008-02-20 13:37     ` Damien Doligez
  2008-02-20 14:37       ` Oliver Bandel
  2008-02-20 16:02     ` Christopher L Conway
  3 siblings, 1 reply; 10+ messages in thread
From: Damien Doligez @ 2008-02-20 13:37 UTC (permalink / raw)
  To: Caml Mailing List; +Cc: John Caml


On 2008-02-20, at 06:18, John Caml wrote:

> However, the memory usage is still pretty bad...it takes nearly an
> order of magnitude more memory than the equivalent C++ program. While
> the C++ program required 800 MB, my ocaml program requires roughly 6
> GB. Am I doing something very inefficiently? My revised code appears
> below.

In order to optimize for space, you need to understand how OCaml
represents things in memory.

>        match murList with
>            | m::u::r::[] ->
>                let rFloat = float_of_string r
>                and mInt = int_of_string m
>                and uInt = int_of_string u in
>
>                let newElement = (uInt, rFloat)
>                and oldList = movieMajor.(mInt) in
>                let newList = List.rev_append [newElement] oldList in
>                Array.set movieMajor mInt newList;


Here, you have an array of lists of pairs.

The space used by the array is 1 + length + size of contents.

The space used by a list is 3 * length + size of the contents.

The space used by a pair is 3 + size of contents.

The space used by an int is 0 (it's already counted in the container).

The space used by a float is a bit tricky: it's normally 3, but
float arrays and records of floats are special: it's 1 in this case.

All these numbers are in 4-byte words, assuming a 32-bit architecture.
On a 64-bit machine, the unit is 8-byte words and floats are size 2
(normally) or 0 (in arrays and records of floats).

Here your memory usage will be : 1 + 17770 + 300M + 300M + 200M = 800M,
or about 6.4 GB (assuming you have a 64-bit machine).

You can do much better by using a tuple of arrays instead of an array
of lists of tuples.  If you use three arrays (for m, u, and r), read
all the data in the arrays, and then do a radix sort (and fill a small
indexing array), you'll be down to 2.4 GB of memory.  If that's still
too much, you could use bigarrays with 32-bit ints and floats, and get
down to 1.2 GB.

I don't see how you can get 800M in your C program unless your ints
are 16-bit, in which case you can do it with bigarrays too.

The only problem with all this, is that you'll have to write the
sorting code yourself.

In conclusion OCaml is not really well-suited to tight memory
situations, but usually you can manage.

-- Damien


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20 13:37     ` Damien Doligez
@ 2008-02-20 14:37       ` Oliver Bandel
  0 siblings, 0 replies; 10+ messages in thread
From: Oliver Bandel @ 2008-02-20 14:37 UTC (permalink / raw)
  To: Caml Mailing List

Zitat von Damien Doligez <damien.doligez@inria.fr>:
[...]
> Here, you have an array of lists of pairs.
[...]

So it's the datastructure, that eats up space.


[...]
> In conclusion OCaml is not really well-suited to tight memory
> situations, but usually you can manage.

So, is it the language, not the used data structures, that eat up space?


Are the used data structures in the C++ library as complex as the above
used? Do we compare things that can be compared? What if the
datastructures in the C++ solution would be as complex as in the OCaml
case, would it then also eat up seom GBs, or are they nevertheless
lighter?

Are more specialized/generic data structures needed to be added as
library for OCaml? What would the Extlib-fans offer as possibilities
here? Are there solutions to this?


Ciao,
   Oliver

P.S.: I do not follow all discussions on that list; I really want,
      but's a lot ongoing here since the developers meeting. There are
      some threads I didn't touched so far. Will there be conclusions/
      digests on those discussions? Possibly one of my questions above
      already is addressed there?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20  5:18   ` large hash tables John Caml
                       ` (2 preceding siblings ...)
  2008-02-20 13:37     ` Damien Doligez
@ 2008-02-20 16:02     ` Christopher L Conway
  2008-02-21 13:54       ` Damien Doligez
  3 siblings, 1 reply; 10+ messages in thread
From: Christopher L Conway @ 2008-02-20 16:02 UTC (permalink / raw)
  To: caml-list

John,

Everybody else has answered your main concerns in great detail. I just
have a silly nitpick...

>                 let newList = List.rev_append [newElement] oldList in

This is what is known as a "functional anti-pattern" [1]. As you
probably know, adding an element at the head of a list is O(1),
whereas adding an element at the tail is O(n). If possible, you should
just keep these lists in reverse order (let newList = newElement ::
oldList) and call List.rev "on demand" when you pull the data out of
the table (or keep separate forward/reversed lists, populating the
forward list on demand, ala Okasaki's deques [2]).

Regards,
Chris

[1] http://trevion.blogspot.com/2006/12/little-knowledge-gets-you-long-way.html
[2] http://www.eecs.usma.edu/webs/people/okasaki/pubs.html#jfp95


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-20 16:02     ` Christopher L Conway
@ 2008-02-21 13:54       ` Damien Doligez
  2008-02-21 16:40         ` Christopher L Conway
  0 siblings, 1 reply; 10+ messages in thread
From: Damien Doligez @ 2008-02-21 13:54 UTC (permalink / raw)
  To: Caml Mailing List


On 2008-02-20, at 17:02, Christopher L Conway wrote:

>>                let newList = List.rev_append [newElement] oldList in
>
> This is what is known as a "functional anti-pattern" [1].

Not really.  This is just a rather long-winded way to write

                let newList = newElement :: oldList in

but it doesn't take much more time or memory to execute.

-- Damien


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Re: large hash tables
  2008-02-21 13:54       ` Damien Doligez
@ 2008-02-21 16:40         ` Christopher L Conway
  0 siblings, 0 replies; 10+ messages in thread
From: Christopher L Conway @ 2008-02-21 16:40 UTC (permalink / raw)
  To: Damien Doligez; +Cc: Caml Mailing List

On Thu, Feb 21, 2008 at 8:54 AM, Damien Doligez <damien.doligez@inria.fr> wrote:
>
>  On 2008-02-20, at 17:02, Christopher L Conway wrote:
>
>  >>                let newList = List.rev_append [newElement] oldList in
>  >
>  > This is what is known as a "functional anti-pattern" [1].
>
>  Not really.  This is just a rather long-winded way to write
>
>                 let newList = newElement :: oldList in
>
>  but it doesn't take much more time or memory to execute.

That was silly of me. A point (not my original point) still stands:
the code is non-idiomatic.

List.rev_append [newElement] oldList
  = List.append [newElement] oldList
  = newElement :: oldList

Regards,
Chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* large hash tables
@ 2008-02-19 23:01 John Caml
  0 siblings, 0 replies; 10+ messages in thread
From: John Caml @ 2008-02-19 23:01 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]

I need to load roughly 100 million items into a memory based hash table,
where each key is an integer and each value is an integer and a float. Under
C++ stdlib's map library, this requires 800 MB of memory. Under my naive
porting of my C++ program to Ocaml, only 12 million rows get loaded before I
get the program crashes with the message "Fatal error: exception
Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use.
(My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient
memory is not the issue.)

Two questions:

1. How can I overcome the Stack_overflow error? (I'm already compiling with
ocamlopt rather than ocamlc, which seems to have increased the stack size
from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my
machine.
2. How can I implment a more efficient hash table? My Ocaml program seems to
require 10x more memory than under C++.

My code appears below. Thank you very much.

--John


--------------
exception SplitError

let read_whole_chan chan =
    let movieMajor = Hashtbl.create 777777 in

    let rec loadLines count =
        let line = input_line chan in
        let murList = Pcre.split line in
        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r in
                Hashtbl.add movieMajor m (u, rFloat);
                if (count mod 10000) == 0 then Printf.printf "count: %d, m:
%s, u: %s, r: %f \n" count m u rFloat;
                loadLines (count + 1)
            | _ -> raise SplitError
  in

    try
        loadLines 0
    with
        End_of_file -> ()
    ;;

let read_whole_file filename =
    let chan = open_in filename in
    read_whole_chan chan
    ;;

let filename = Sys.argv.(1);;

let str = read_whole_file filename;;

[-- Attachment #2: Type: text/html, Size: 2598 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-02-21 16:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fa.XXbywsQknpl7bhlesWN8vFLM58c@ifi.uio.no>
     [not found] ` <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
2008-02-20  5:18   ` large hash tables John Caml
2008-02-20  6:11     ` [Caml-list] " Francois Rouaix
2008-02-20  8:37     ` David Allsopp
2008-02-20  8:44       ` Alain Frisch
2008-02-20 13:37     ` Damien Doligez
2008-02-20 14:37       ` Oliver Bandel
2008-02-20 16:02     ` Christopher L Conway
2008-02-21 13:54       ` Damien Doligez
2008-02-21 16:40         ` Christopher L Conway
2008-02-19 23:01 John Caml

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).