From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id OAA23356; Tue, 27 Aug 2002 14:07:00 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id OAA23197 for ; Tue, 27 Aug 2002 14:06:59 +0200 (MET DST) Received: from gull.mail.pas.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g7RC6wn29948 for ; Tue, 27 Aug 2002 14:06:58 +0200 (MET DST) Received: from user-0ccegbj.cable.mindspring.com ([24.199.65.115] helo=flapdragon.homeip.net) by gull.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17jf7Y-0005mq-00; Tue, 27 Aug 2002 05:06:56 -0700 Subject: Re: [Caml-list] intersecting huge integer sets From: "Yaron M. Minsky" To: Tibor Simko Cc: caml-list@inria.fr In-Reply-To: <87n0r835w9.fsf@pcdh91.cern.ch> References: <87n0r835w9.fsf@pcdh91.cern.ch> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 Date: 27 Aug 2002 08:06:54 -0400 Message-Id: <1030450016.24909.45.camel@dragonfly.localdomain> Mime-Version: 1.0 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk If the sets are really as dense as you say they are (on the order of 50% of all possible elements), then the fastest thing is probably just to use a bool array. Such an approach would require 3 passes: one to set all the elements of the array to false. One to mark the elements corresponding to set 1 to true, and then one pass over set 2 to see which of its elements are shared by 1. y On Tue, 2002-08-27 at 07:20, Tibor Simko wrote: > Hello > > I wonder what is the OCaml canonical way to get fast intersection of > huge sets of integers: say, to intersect two sets of 500000 items > containing distinct integer values from the (0,1000000) interval. > > I quickly tested the Hashtbl way, along the lines: > > +---- > | let dsize d = > | (* return length of dictionary *) > | let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;; > | > | let dcreate nitems maxval = > | (* create dictionary of nitems with integer keys from (0,maxval) interval *) > | let d = Hashtbl.create 0 and > | size = ref 0 and > | rnd = ref 0 in > | while !size < nitems && !size < maxval do > | rnd := Random.int maxval; > | if Hashtbl.mem d !rnd > | then () > | else (incr size; > | Hashtbl.add d !rnd 1) > | done; > | d;; > | > | let dintersect d1 d2 = > | (* intersect two dictionaries: retain elements in d1 that are also in d2 *) > | (* note: assuming that d1 is smaller in size than d2. Otherwise not efficient. *) > | Hashtbl.iter (fun key _ -> > | if Hashtbl.mem d2 key then () > | else Hashtbl.remove d1 key) d1;; > | > | (* main *) > | let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and > | maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;; > | Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval; > | flush stdout; > | let d1 = dcreate nitems maxval and > | d2 = dcreate nitems maxval in > | let t1 = Sys.time () in > | dintersect d1 d2; > | Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);; > +---- > > I found that the dintersect function appears to be about 3-4 times > faster than its Python equivalent, and about 10% faster than Java's > native HashSet retainAll() method. > > I wonder how to improve OCaml performance in this special case. Is > there a better suited OCaml data structure to simulate huge hashed > sets of integers rather than by using OCaml's native Hashtbl? > > Tibor > ------------------- > To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr > Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners -- |--------/ Yaron M. Minsky \--------| |--------\ http://www.cs.cornell.edu/home/yminsky/ /--------| Open PGP --- KeyID B1FFD916 (new key as of Dec 4th) Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners