caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Tibor Simko <tibor.simko@cern.ch>
To: caml-list@inria.fr
Subject: [Caml-list] intersecting huge integer sets
Date: Tue, 27 Aug 2002 13:20:06 +0200	[thread overview]
Message-ID: <87n0r835w9.fsf@pcdh91.cern.ch> (raw)

Hello

I wonder what is the OCaml canonical way to get fast intersection of
huge sets of integers: say, to intersect two sets of 500000 items
containing distinct integer values from the (0,1000000) interval.  

I quickly tested the Hashtbl way, along the lines:

+---- 
|    let dsize d = 
|      (* return length of dictionary *)
|      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
| 
|    let dcreate nitems maxval =     
|      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
|      let d = Hashtbl.create 0 and
|        size = ref 0 and
|        rnd = ref 0 in
|        while !size < nitems && !size < maxval do
|          rnd := Random.int maxval;
|          if Hashtbl.mem d !rnd 
|          then ()
|          else (incr size;
|                Hashtbl.add d !rnd 1)
|        done;
|      d;;
| 
|    let dintersect d1 d2 =
|      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
|      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
|      Hashtbl.iter (fun key _ ->
|                      if Hashtbl.mem d2 key then ()
|                      else Hashtbl.remove d1 key) d1;;
| 
|    (* main *)
|    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
|        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
|    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
|    flush stdout;
|    let d1 = dcreate nitems maxval and
|        d2 = dcreate nitems maxval in
|      let t1 = Sys.time () in
|        dintersect d1 d2;
|        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
+----

I found that the dintersect function appears to be about 3-4 times
faster than its Python equivalent, and about 10% faster than Java's
native HashSet retainAll() method.

I wonder how to improve OCaml performance in this special case.  Is
there a better suited OCaml data structure to simulate huge hashed
sets of integers rather than by using OCaml's native Hashtbl?

Tibor
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


             reply	other threads:[~2002-08-27 11:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-27 11:20 Tibor Simko [this message]
2002-08-27 11:52 ` Markus Mottl
2002-08-27 12:09   ` Diego Olivier Fernandez Pons
2002-08-27 12:06 ` Yaron M. Minsky
2002-08-27 12:12   ` Markus Mottl
2002-08-28 12:29     ` Tibor Simko
2002-08-29 10:13 ` Diego Olivier Fernandez Pons
2002-08-29 11:33   ` [Caml-list] barre dans le filtrage de motifs Diego Olivier Fernandez Pons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87n0r835w9.fsf@pcdh91.cern.ch \
    --to=tibor.simko@cern.ch \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).