caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* large hash tables
@ 2008-02-19 23:01 John Caml
  2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: John Caml @ 2008-02-19 23:01 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]

I need to load roughly 100 million items into a memory based hash table,
where each key is an integer and each value is an integer and a float. Under
C++ stdlib's map library, this requires 800 MB of memory. Under my naive
porting of my C++ program to Ocaml, only 12 million rows get loaded before I
get the program crashes with the message "Fatal error: exception
Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use.
(My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient
memory is not the issue.)

Two questions:

1. How can I overcome the Stack_overflow error? (I'm already compiling with
ocamlopt rather than ocamlc, which seems to have increased the stack size
from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my
machine.
2. How can I implment a more efficient hash table? My Ocaml program seems to
require 10x more memory than under C++.

My code appears below. Thank you very much.

--John


--------------
exception SplitError

let read_whole_chan chan =
    let movieMajor = Hashtbl.create 777777 in

    let rec loadLines count =
        let line = input_line chan in
        let murList = Pcre.split line in
        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r in
                Hashtbl.add movieMajor m (u, rFloat);
                if (count mod 10000) == 0 then Printf.printf "count: %d, m:
%s, u: %s, r: %f \n" count m u rFloat;
                loadLines (count + 1)
            | _ -> raise SplitError
  in

    try
        loadLines 0
    with
        End_of_file -> ()
    ;;

let read_whole_file filename =
    let chan = open_in filename in
    read_whole_chan chan
    ;;

let filename = Sys.argv.(1);;

let str = read_whole_file filename;;

[-- Attachment #2: Type: text/html, Size: 2598 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:01 large hash tables John Caml
@ 2008-02-19 23:34 ` Gabriel Kerneis
  2008-02-19 23:36 ` Gerd Stolpmann
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Gabriel Kerneis @ 2008-02-19 23:34 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

Hello,

it won't solve any of you problems but I have a few remarks on your
code:

> exception SplitError
> 
> let read_whole_chan chan =
>     let movieMajor = Hashtbl.create 777777 in

Here, movieMajor in declared inside read_whole_chan and never returned,
so you won't be able to use it (but it might as well be a test program
I guess).

>     let rec loadLines count =
>         let line = input_line chan in
>         let murList = Pcre.split line in
>         match murList with
>             | m::u::r::[] ->
>                 let rFloat = float_of_string r in
>                 Hashtbl.add movieMajor m (u, rFloat);
>                 loadLines (count + 1)

You use tail-recursion properly so your Stack overflow doesn't come
from there. No idea why it occurs, though...

>             | _ -> raise SplitError
>   in
> 
>     try
>         loadLines 0
>     with
>         End_of_file -> ()
>     ;;
> 
> let read_whole_file filename =
>     let chan = open_in filename in
>     read_whole_chan chan
>     ;;

You should close the chan when you're done.

> let filename = Sys.argv.(1);;
> 
> let str = read_whole_file filename;;


Regards,
-- 
Gabriel Kerneis

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:01 large hash tables John Caml
  2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
@ 2008-02-19 23:36 ` Gerd Stolpmann
  2008-02-19 23:51 ` Francois Rouaix
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Gerd Stolpmann @ 2008-02-19 23:36 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list


Am Dienstag, den 19.02.2008, 15:01 -0800 schrieb John Caml:
> I need to load roughly 100 million items into a memory based hash
> table, where each key is an integer and each value is an integer and a
> float. Under C++ stdlib's map library, this requires 800 MB of memory.
> Under my naive porting of my C++ program to Ocaml, only 12 million
> rows get loaded before I get the program crashes with the message
> "Fatal error: exception Stack_overflow". At the time of the crash,
> nearly 2 GB of memory are in use. (My machine -- AMD64 architecture --
> has 8 GB of memory, so insufficient memory is not the issue.)
> 
> Two questions:
> 
> 1. How can I overcome the Stack_overflow error? (I'm already compiling
> with ocamlopt rather than ocamlc, which seems to have increased the
> stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8
> GB on my machine.
> 2. How can I implment a more efficient hash table? My Ocaml program
> seems to require 10x more memory than under C++.
> 
> My code appears below. Thank you very much.

The stack overflow is very strange. I would expect it only for one data
case: that there is a single key that occurs very often (several 10000
times). In this case, there can be a stack overflow when the hash table
is resized. It is possible to work around by handling this case
explicitly.

Gerd


> 
> --John
> 
> 
> --------------
> exception SplitError
> 
> let read_whole_chan chan =
>     let movieMajor = Hashtbl.create 777777 in
> 
>     let rec loadLines count =
>         let line = input_line chan in
>         let murList = Pcre.split line in
>         match murList with
>             | m::u::r::[] ->
>                 let rFloat = float_of_string r in
>                 Hashtbl.add movieMajor m (u, rFloat);
>                 if (count mod 10000) == 0 then Printf.printf "count: %
> d, m: %s, u: %s, r: %f \n" count m u rFloat;
>                 loadLines (count + 1)
>             | _ -> raise SplitError
>   in
>         
>     try
>         loadLines 0
>     with
>         End_of_file -> ()
>     ;;
> 
> let read_whole_file filename =
>     let chan = open_in filename in
>     read_whole_chan chan
>     ;;
> 
> let filename = Sys.argv.(1);;
> 
> let str = read_whole_file filename;;
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:01 large hash tables John Caml
  2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
  2008-02-19 23:36 ` Gerd Stolpmann
@ 2008-02-19 23:51 ` Francois Rouaix
  2008-02-20  9:37   ` Berke Durak
  2008-02-20 12:48 ` Richard Jones
  2008-02-20 15:54 ` Oliver Bandel
  4 siblings, 1 reply; 13+ messages in thread
From: Francois Rouaix @ 2008-02-19 23:51 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2978 bytes --]

In the resizing code there is a non-tailrec function (insert_bucket). This
is most likely causing the stack overflow, as I can't see any other non tail
recursive function at first glance. Looks like it's not tail rec in order to
maintain an invariant on the order of elements. If that invariant is not
useful to you, you might want to write a slightly different version of the
Hashtbl module, where insert_bucket would be tail rec.
Also, during resizing, memory usage will be twice the memory required for
the table (roughly), since the bucket array remains available until the
resize is completed, so all the bucket contents exist in two versions (old
and new). You might want to stick to a large initial size and do not attempt
resizing.
--f


2008/2/19 John Caml <camljohn42@gmail.com>:

> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory. Under my naive
> porting of my C++ program to Ocaml, only 12 million rows get loaded before I
> get the program crashes with the message "Fatal error: exception
> Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use.
> (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient
> memory is not the issue.)
>
> Two questions:
>
> 1. How can I overcome the Stack_overflow error? (I'm already compiling
> with ocamlopt rather than ocamlc, which seems to have increased the stack
> size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my
> machine.
> 2. How can I implment a more efficient hash table? My Ocaml program seems
> to require 10x more memory than under C++.
>
> My code appears below. Thank you very much.
>
> --John
>
>
> --------------
> exception SplitError
>
> let read_whole_chan chan =
>     let movieMajor = Hashtbl.create 777777 in
>
>     let rec loadLines count =
>         let line = input_line chan in
>         let murList = Pcre.split line in
>         match murList with
>             | m::u::r::[] ->
>                 let rFloat = float_of_string r in
>                 Hashtbl.add movieMajor m (u, rFloat);
>                 if (count mod 10000) == 0 then Printf.printf "count: %d,
> m: %s, u: %s, r: %f \n" count m u rFloat;
>                 loadLines (count + 1)
>             | _ -> raise SplitError
>   in
>
>     try
>         loadLines 0
>     with
>         End_of_file -> ()
>     ;;
>
> let read_whole_file filename =
>     let chan = open_in filename in
>     read_whole_chan chan
>     ;;
>
> let filename = Sys.argv.(1);;
>
> let str = read_whole_file filename;;
>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

[-- Attachment #2: Type: text/html, Size: 4256 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:51 ` Francois Rouaix
@ 2008-02-20  9:37   ` Berke Durak
  2008-02-20  9:56     ` Berke Durak
  0 siblings, 1 reply; 13+ messages in thread
From: Berke Durak @ 2008-02-20  9:37 UTC (permalink / raw)
  To: Francois Rouaix; +Cc: John Caml, caml-list

Francois Rouaix a écrit :
> In the resizing code there is a non-tailrec function (insert_bucket). 
> This is most likely causing the stack overflow, as I can't see any other 
> non tail recursive function at first glance. Looks like it's not tail 
> rec in order to maintain an invariant on the order of elements. If that 
> invariant is not useful to you, you might want to write a slightly 
> different version of the Hashtbl module, where insert_bucket would be 
> tail rec.
> Also, during resizing, memory usage will be twice the memory required 
> for the table (roughly), since the bucket array remains available until 
> the resize is completed, so all the bucket contents exist in two 
> versions (old and new). You might want to stick to a large initial size 
> and do not attempt resizing.

In that casse a quick hack could also be to slightly randomize the key, as in

let digest_bytes = 5

let keyify u =
   (String.substring (Digest.string u) 0 digest_bytes, u)

 >     let read_whole_chan chan =
 >         let movieMajor = Hashtbl.create 777777 in
 >
 >         let rec loadLines count =
 >             let line = input_line chan in
 >             let murList = Pcre.split line in
 >             match murList with
 >                 | m::u::r::[] ->
 >                     let rFloat = float_of_string r in
 >                     Hashtbl.add (keyify movieMajor) m (u, rFloat);
 >                     if (count mod 10000) == 0 then Printf.printf "count:
 >     %d, m: %s, u: %s, r: %f \n" count m u rFloat;
 >                     loadLines (count + 1)
 >                 | _ -> raise SplitError
 >       in
 >
 >         try
 >             loadLines 0
 >         with
 >             End_of_file -> ()
 >         ;;

-- 
Berke DURAK


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-20  9:37   ` Berke Durak
@ 2008-02-20  9:56     ` Berke Durak
  0 siblings, 0 replies; 13+ messages in thread
From: Berke Durak @ 2008-02-20  9:56 UTC (permalink / raw)
  To: Berke Durak; +Cc: Francois Rouaix, caml-list

Berke Durak a écrit :
> Francois Rouaix a écrit :
>> In the resizing code there is a non-tailrec function (insert_bucket). 
>> This is most likely causing the stack overflow, as I can't see any 
>> other non tail recursive function at first glance. Looks like it's not 
>> tail rec in order to maintain an invariant on the order of elements. 
>> If that invariant is not useful to you, you might want to write a 
>> slightly different version of the Hashtbl module, where insert_bucket 
>> would be tail rec.
>> Also, during resizing, memory usage will be twice the memory required 
>> for the table (roughly), since the bucket array remains available 
>> until the resize is completed, so all the bucket contents exist in two 
>> versions (old and new). You might want to stick to a large initial 
>> size and do not attempt resizing.
> 
> In that casse a quick hack could also be to slightly randomize the key, 
> as in
> 
> let digest_bytes = 5
> 
> let keyify u =
>   (String.substring (Digest.string u) 0 digest_bytes, u)
> 
>  >     let read_whole_chan chan =
>  >         let movieMajor = Hashtbl.create 777777 in
>  >
>  >         let rec loadLines count =
>  >             let line = input_line chan in
>  >             let murList = Pcre.split line in
>  >             match murList with
>  >                 | m::u::r::[] ->
>  >                     let rFloat = float_of_string r in
>  >                     Hashtbl.add (keyify movieMajor) m (u, rFloat);
>  >                     if (count mod 10000) == 0 then Printf.printf "count:
>  >     %d, m: %s, u: %s, r: %f \n" count m u rFloat;
>  >                     loadLines (count + 1)
>  >                 | _ -> raise SplitError
>  >       in
>  >
>  >         try
>  >             loadLines 0
>  >         with
>  >             End_of_file -> ()
>  >         ;;
> 

That was stupid.

Actually that's a solution to another problem, as the overflow is
probably not due to hash collisions - we are talking of the same keys here.

But I have a solution: do not use the Hashtbl to store multiple entries per key;
use a Queue per entry for that.  Queues include some Obj.magic and are actually pretty
cheap.

let q =
   try
     Hashtbl.find movieMajor m
   with
   | Not_found ->
       let q = Queue.create () in
       Hashtbl.add movieMajor m q;
       q
in
Queue.push (u, rFloat) q

-- 
Berke DURAK


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:01 large hash tables John Caml
                   ` (2 preceding siblings ...)
  2008-02-19 23:51 ` Francois Rouaix
@ 2008-02-20 12:48 ` Richard Jones
  2008-02-20 15:54 ` Oliver Bandel
  4 siblings, 0 replies; 13+ messages in thread
From: Richard Jones @ 2008-02-20 12:48 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list

On Tue, Feb 19, 2008 at 03:01:46PM -0800, John Caml wrote:
> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory.

You can do as well as C++, but you need to understand how values are
stored in the OCaml runtime.

In C++ you're using an int and a (single-precision) float, and
presumably you're storing them packed into an array.  Something like
this?

  struct item { int i; float f; };
  struct item array[100000000];

The storage requirements for this are 100 * 10^6 * 8 bytes = 800 MB,
as you say.

In OCaml things are a bit different.  Firstly you are using
double-precision floats ("float" in OCaml is the same as double in C),
so those are 8 bytes.  Secondly you're using OCaml ints, and since you
must be on a 64 bit machine, those are 64 bits wide (although in OCaml
you can only use 63 of those bits).  So the minimum for storing an int
and a float in OCaml is 16 bytes.

It gets worse though because you're using a (int * float) tuple.
These are stored particularly inefficiently in OCaml.  There is an 8
byte header, 8 bytes for the int, then the float is a boxed pointer
[stored in a separate allocation] which means an 8 byte pointer,
another 8 byte header plus 8 bytes for the double-precision float.  So
if I've got the math right, the cost of storing each tuple would be:

  8 +              8 +   8 +               8 +              8
  header of tuple  int   pointer to float  header of float  float
  = 40 bytes.

Storing those in a simple array means another pointer per tuple
(tuples aren't packed into the array, but pointed to), so that's a
total of around:

  100 * 10^6 * (8 + 40) = 4.8 GB.

Luckily there are much better ways to store this data in OCaml.  If
you want to keep it simple, use two arrays, a 'float array' to store
double-precision floats and an 'int array' to store 63 bit ints.
These are both unboxed in OCaml, so this works out as:

  float array:   100 * 10^6 * 8
  int array:   + 100 * 10^6 * 8 = 1.6 GB

You can do better (the same as C++) by using Bigarray.  The basic
saving of Bigarray is that you can store single-precision float and 32
bit integers directly in them.  So either go for two Bigarrays, or use
one Bigarray (of int32_elt type and length 200 million elements) and a
module wrapping the details of inserting and getting the pairs (also
the functions Int32.bits_of_float and Int32.float_of_bits will help
you here).

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-19 23:01 large hash tables John Caml
                   ` (3 preceding siblings ...)
  2008-02-20 12:48 ` Richard Jones
@ 2008-02-20 15:54 ` Oliver Bandel
  2008-02-21 22:45   ` John Caml
  4 siblings, 1 reply; 13+ messages in thread
From: Oliver Bandel @ 2008-02-20 15:54 UTC (permalink / raw)
  To: caml-list

Zitat von John Caml <camljohn42@gmail.com>:

> I need to load roughly 100 million items into a memory based hash
> table,
> where each key is an integer and each value is an integer and a
> float. Under
> C++ stdlib's map library, this requires 800 MB of memory. Under my
> naive
[...]


Did you tried Map-Module in OCaml?

Or would it need more mem than Hash-module per element?
At least there would be no resizing operations...

For my applications I have habitted myself to use Map
instead of Hash, and it worked quite good and performant
even on more or less large files.

BTW: how big are the files in use to need 800 MB memeory
     with the C++ implementation?


Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-20 15:54 ` Oliver Bandel
@ 2008-02-21 22:45   ` John Caml
  2008-02-22  0:33     ` Richard Jones
  2008-02-22 14:19     ` Brian Hurt
  0 siblings, 2 replies; 13+ messages in thread
From: John Caml @ 2008-02-21 22:45 UTC (permalink / raw)
  To: caml-list

The equivalent C++ program uses 874 MB of memory in total. Each of the
1 million records is stored in a vector using 1 single-precision float
and 1 int. Indeed, my machine is AMD64 so Ocaml int's are presumably 8
bytes.

I've rewritten my Ocaml program again, this time using Bigarray. Its
memory usage is now the same as under C++, so that's good news.
However, my program is quite ugly now, and it's actually more than
twice as long as my C++ program. Any suggestions for simplifying this
program? The way I initialize the "movieMajor" Array seems especially
wonky, but I couldn't figure out a better way.

Thank you all very much.

John



----------------------------

open Bigarray;;

exception SplitError;;
exception ImpossibleError;;

type listPairs = {userList : int32 list; resList : float list};;

type baPairs = {
	userBa : (int32, int32_elt, c_layout) Array1.t;
	resBa : (float, float32_elt, c_layout) Array1.t};;


let loadWholeFile filename =
	let infile = open_in filename in
	let dummyUserBa = Array1.of_array int32 c_layout (Array.of_list []) in
	let dummyResBa = Array1.of_array float32 c_layout (Array.of_list []) in
	let dummyBas = {userBa = dummyUserBa; resBa = dummyResBa} in
	let movieMajor = Array.make 17770 dummyBas in

	let recordWholeMovie mInt curListPairs =
		let userBa = Array1.of_array int32 c_layout (Array.of_list
curListPairs.userList) in
		let resBa = Array1.of_array float32 c_layout (Array.of_list
curListPairs.resList) in
		let movieBas = {userBa = userBa; resBa = resBa} in
		movieMajor.(mInt) <- movieBas
	in

	let rec loadLines prevPairs prevMovie count =
		let line = input_line infile in
		let murList = Pcre.split line in

		match murList with
			| m::u::r::[] ->
				let rFloat = float_of_string r
				and mInt = int_of_string m
				and uInt = int_of_string u in

				if (count mod 1000000) = 0 then begin
					Printf.printf "count: %d\n" count;
					flush stdout;
					end;

				let newPairs =
					if mInt = prevMovie then
						{userList = Int32.of_int(uInt) :: prevPairs.userList; resList =
rFloat :: prevPairs.resList}
					else begin
						recordWholeMovie mInt prevPairs;
						{userList = [Int32.of_int(uInt)]; resList = [rFloat]}
					end
				in

				loadLines newPairs mInt (count + 1)
				
				(* bug: last movie will not get loaded *)

			| _ -> raise SplitError
  in

	try
		let emptyPairs = {userList = []; resList = []} in
		loadLines emptyPairs (-1) 0
	with
		End_of_file -> close_in infile;
	
  movieMajor;;


let filename = Sys.argv.(1);;
loadWholeFile filename;;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-21 22:45   ` John Caml
@ 2008-02-22  0:33     ` Richard Jones
  2008-02-24  5:39       ` John Caml
  2008-02-22 14:19     ` Brian Hurt
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Jones @ 2008-02-22  0:33 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list

Mine version's a bit longer than your version, but hopefully more
idiomatic and easier to understand.

Program - http://www.annexia.org/tmp/movies.ml
Create the test file - http://www.annexia.org/tmp/make_movies.ml

It's best to read the program like this:

(1) Start with the _interface_ ('signature') of the new ExtArray1
module & type.  _Ignore_ the implementation of this module for now.

(2) Then look at the main part of the program (from where we allocate
the result array down through the loop which reads the data).

(3) Then look at the implementation of the module.  The main
complexity is that you can't just extend a Bigarray, but you have to
keep reallocating it (in large chunks for efficiency).

I measured it as taking some 230 MB for a 10 million line data file,
but that doesn't necessarily mean it'll take 2 GB for 100 million
lines because there's some space overhead which will decline as a
proportion of the total memory used.

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-21 22:45   ` John Caml
  2008-02-22  0:33     ` Richard Jones
@ 2008-02-22 14:19     ` Brian Hurt
  1 sibling, 0 replies; 13+ messages in thread
From: Brian Hurt @ 2008-02-22 14:19 UTC (permalink / raw)
  To: John Caml; +Cc: caml-list

John Caml wrote:

>The equivalent C++ program uses 874 MB of memory in total. Each of the
>1 million records is stored in a vector using 1 single-precision float
>and 1 int. Indeed, my machine is AMD64 so Ocaml int's are presumably 8
>bytes.
>  
>
C int's on AMD64 are still 4 bytes- longs are 8 bytes.  You can prove 
this by compiling a quick program:
#include <stdio.h>
int main(void) {
    printf("Ints are %lu bytes long.\n", (unsigned long) sizeof(int));
    return 0;
}

>I've rewritten my Ocaml program again, this time using Bigarray. Its
>memory usage is now the same as under C++, so that's good news.
>However, my program is quite ugly now, and it's actually more than
>twice as long as my C++ program. Any suggestions for simplifying this
>program? The way I initialize the "movieMajor" Array seems especially
>wonky, but I couldn't figure out a better way.
>
>  
>
It's generally a good idea to back off and think about what problem 
you're trying to solve.

Where Ocaml generally wins on memory utilization is using immutable data 
structures and sharing data, instead of copying them.  This is where a 
lot of decisions Ocaml made on how to represent things suddenly make a 
lot of sense, if you think in terms of data sharing.  And in lots of 
complicated "real" code, the memory gains made by sharing are huge 
compared to the losses not incurred by not copying.

Brian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] large hash tables
  2008-02-22  0:33     ` Richard Jones
@ 2008-02-24  5:39       ` John Caml
  0 siblings, 0 replies; 13+ messages in thread
From: John Caml @ 2008-02-24  5:39 UTC (permalink / raw)
  To: Richard Jones; +Cc: caml-list

Richard, Thank you so much revising my program. I learned a lot from
reading over your changes, and the program works very nicely now. 1.2
GB for all 1 million items, which is efficient enough for all
practical purposes. Thanks again.

John


On Thu, Feb 21, 2008 at 4:33 PM, Richard Jones <rich@annexia.org> wrote:
> Mine version's a bit longer than your version, but hopefully more
>  idiomatic and easier to understand.
>
>  Program - http://www.annexia.org/tmp/movies.ml
>  Create the test file - http://www.annexia.org/tmp/make_movies.ml
>
>  It's best to read the program like this:
>
>  (1) Start with the _interface_ ('signature') of the new ExtArray1
>  module & type.  _Ignore_ the implementation of this module for now.
>
>  (2) Then look at the main part of the program (from where we allocate
>  the result array down through the loop which reads the data).
>
>  (3) Then look at the implementation of the module.  The main
>  complexity is that you can't just extend a Bigarray, but you have to
>  keep reallocating it (in large chunks for efficiency).
>
>  I measured it as taking some 230 MB for a 10 million line data file,
>  but that doesn't necessarily mean it'll take 2 GB for 100 million
>  lines because there's some space overhead which will decline as a
>  proportion of the total memory used.
>
>
>
>  Rich.
>
>  --
>  Richard Jones
>  Red Hat
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: large hash tables
       [not found] ` <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
@ 2008-02-20  5:18   ` John Caml
  0 siblings, 0 replies; 13+ messages in thread
From: John Caml @ 2008-02-20  5:18 UTC (permalink / raw)
  To: caml-list

Thank you all for the assistance.

I've resolved the Stack_overflow problem by using an Array instead of
a Hashtbl; my keys were just consecutive integers, so this later
approach is clearly preferable.

However, the memory usage is still pretty bad...it takes nearly an
order of magnitude more memory than the equivalent C++ program. While
the C++ program required 800 MB, my ocaml program requires roughly 6
GB. Am I doing something very inefficiently? My revised code appears
below.

Also, if you have any other coding suggestions I'd appreciate hearing
them. I'm a long-time coder but new to Ocaml and eager to learn.


--------------

exception SplitError


let loadWholeFile filename =
    let infile = open_in filename
    and movieMajor = Array.make 17770 [] in

    let rec loadLines count =
        let line = input_line infile in
        let murList = Pcre.split line in

        match murList with
            | m::u::r::[] ->
                let rFloat = float_of_string r
                and mInt = int_of_string m
                and uInt = int_of_string u in

                let newElement = (uInt, rFloat)
                and oldList = movieMajor.(mInt) in
                let newList = List.rev_append [newElement] oldList in
                Array.set movieMajor mInt newList;

                if (count mod 1000000) == 0 then begin
                    Printf.printf "count: %d\n" count;
                    flush stdout;
                    end;

                    loadLines (count + 1)

            | _ -> raise SplitError
  in

    try
        loadLines 0
    with
        End_of_file -> close_in infile;
        movieMajor
;;


let filename = Sys.argv.(1);;
let str = loadWholeFile filename;;


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-02-24  5:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-19 23:01 large hash tables John Caml
2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
2008-02-19 23:36 ` Gerd Stolpmann
2008-02-19 23:51 ` Francois Rouaix
2008-02-20  9:37   ` Berke Durak
2008-02-20  9:56     ` Berke Durak
2008-02-20 12:48 ` Richard Jones
2008-02-20 15:54 ` Oliver Bandel
2008-02-21 22:45   ` John Caml
2008-02-22  0:33     ` Richard Jones
2008-02-24  5:39       ` John Caml
2008-02-22 14:19     ` Brian Hurt
     [not found] <fa.XXbywsQknpl7bhlesWN8vFLM58c@ifi.uio.no>
     [not found] ` <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
2008-02-20  5:18   ` John Caml

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).