caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] intersecting huge integer sets
@ 2002-08-27 11:20 Tibor Simko
  2002-08-27 11:52 ` Markus Mottl
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tibor Simko @ 2002-08-27 11:20 UTC (permalink / raw)
  To: caml-list

Hello

I wonder what is the OCaml canonical way to get fast intersection of
huge sets of integers: say, to intersect two sets of 500000 items
containing distinct integer values from the (0,1000000) interval.  

I quickly tested the Hashtbl way, along the lines:

+---- 
|    let dsize d = 
|      (* return length of dictionary *)
|      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
| 
|    let dcreate nitems maxval =     
|      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
|      let d = Hashtbl.create 0 and
|        size = ref 0 and
|        rnd = ref 0 in
|        while !size < nitems && !size < maxval do
|          rnd := Random.int maxval;
|          if Hashtbl.mem d !rnd 
|          then ()
|          else (incr size;
|                Hashtbl.add d !rnd 1)
|        done;
|      d;;
| 
|    let dintersect d1 d2 =
|      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
|      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
|      Hashtbl.iter (fun key _ ->
|                      if Hashtbl.mem d2 key then ()
|                      else Hashtbl.remove d1 key) d1;;
| 
|    (* main *)
|    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
|        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
|    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
|    flush stdout;
|    let d1 = dcreate nitems maxval and
|        d2 = dcreate nitems maxval in
|      let t1 = Sys.time () in
|        dintersect d1 d2;
|        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
+----

I found that the dintersect function appears to be about 3-4 times
faster than its Python equivalent, and about 10% faster than Java's
native HashSet retainAll() method.

I wonder how to improve OCaml performance in this special case.  Is
there a better suited OCaml data structure to simulate huge hashed
sets of integers rather than by using OCaml's native Hashtbl?

Tibor
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-08-29 11:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-27 11:20 [Caml-list] intersecting huge integer sets Tibor Simko
2002-08-27 11:52 ` Markus Mottl
2002-08-27 12:09   ` Diego Olivier Fernandez Pons
2002-08-27 12:06 ` Yaron M. Minsky
2002-08-27 12:12   ` Markus Mottl
2002-08-28 12:29     ` Tibor Simko
2002-08-29 10:13 ` Diego Olivier Fernandez Pons
2002-08-29 11:33   ` [Caml-list] barre dans le filtrage de motifs Diego Olivier Fernandez Pons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).