From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id NAA22133; Tue, 27 Aug 2002 13:20:07 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id NAA22033 for ; Tue, 27 Aug 2002 13:20:06 +0200 (MET DST) Received: from smtp3.cern.ch (smtp3.cern.ch [137.138.131.164]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g7RBK5f29241 for ; Tue, 27 Aug 2002 13:20:05 +0200 (MET DST) Received: from pcdh91.cern.ch (pcdh91.cern.ch [137.138.30.104]) by smtp3.cern.ch (8.12.1/8.12.1) with ESMTP id g7RBK5Ju003539 for ; Tue, 27 Aug 2002 13:20:05 +0200 (MET DST) Received: from simko by pcdh91.cern.ch with local (Exim 3.35 #1 (Debian)) id 17jeOE-0007Yz-00 for ; Tue, 27 Aug 2002 13:20:06 +0200 To: caml-list@inria.fr Subject: [Caml-list] intersecting huge integer sets From: Tibor Simko Date: Tue, 27 Aug 2002 13:20:06 +0200 Message-ID: <87n0r835w9.fsf@pcdh91.cern.ch> User-Agent: Gnus/5.090007 (Oort Gnus v0.07) Emacs/21.2 (i386-debian-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Hello I wonder what is the OCaml canonical way to get fast intersection of huge sets of integers: say, to intersect two sets of 500000 items containing distinct integer values from the (0,1000000) interval. I quickly tested the Hashtbl way, along the lines: +---- | let dsize d = | (* return length of dictionary *) | let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;; | | let dcreate nitems maxval = | (* create dictionary of nitems with integer keys from (0,maxval) interval *) | let d = Hashtbl.create 0 and | size = ref 0 and | rnd = ref 0 in | while !size < nitems && !size < maxval do | rnd := Random.int maxval; | if Hashtbl.mem d !rnd | then () | else (incr size; | Hashtbl.add d !rnd 1) | done; | d;; | | let dintersect d1 d2 = | (* intersect two dictionaries: retain elements in d1 that are also in d2 *) | (* note: assuming that d1 is smaller in size than d2. Otherwise not efficient. *) | Hashtbl.iter (fun key _ -> | if Hashtbl.mem d2 key then () | else Hashtbl.remove d1 key) d1;; | | (* main *) | let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and | maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;; | Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval; | flush stdout; | let d1 = dcreate nitems maxval and | d2 = dcreate nitems maxval in | let t1 = Sys.time () in | dintersect d1 d2; | Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);; +---- I found that the dintersect function appears to be about 3-4 times faster than its Python equivalent, and about 10% faster than Java's native HashSet retainAll() method. I wonder how to improve OCaml performance in this special case. Is there a better suited OCaml data structure to simulate huge hashed sets of integers rather than by using OCaml's native Hashtbl? Tibor ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners