caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] intersecting huge integer sets
@ 2002-08-27 11:20 Tibor Simko
  2002-08-27 11:52 ` Markus Mottl
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tibor Simko @ 2002-08-27 11:20 UTC (permalink / raw)
  To: caml-list

Hello

I wonder what is the OCaml canonical way to get fast intersection of
huge sets of integers: say, to intersect two sets of 500000 items
containing distinct integer values from the (0,1000000) interval.  

I quickly tested the Hashtbl way, along the lines:

+---- 
|    let dsize d = 
|      (* return length of dictionary *)
|      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
| 
|    let dcreate nitems maxval =     
|      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
|      let d = Hashtbl.create 0 and
|        size = ref 0 and
|        rnd = ref 0 in
|        while !size < nitems && !size < maxval do
|          rnd := Random.int maxval;
|          if Hashtbl.mem d !rnd 
|          then ()
|          else (incr size;
|                Hashtbl.add d !rnd 1)
|        done;
|      d;;
| 
|    let dintersect d1 d2 =
|      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
|      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
|      Hashtbl.iter (fun key _ ->
|                      if Hashtbl.mem d2 key then ()
|                      else Hashtbl.remove d1 key) d1;;
| 
|    (* main *)
|    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
|        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
|    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
|    flush stdout;
|    let d1 = dcreate nitems maxval and
|        d2 = dcreate nitems maxval in
|      let t1 = Sys.time () in
|        dintersect d1 d2;
|        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
+----

I found that the dintersect function appears to be about 3-4 times
faster than its Python equivalent, and about 10% faster than Java's
native HashSet retainAll() method.

I wonder how to improve OCaml performance in this special case.  Is
there a better suited OCaml data structure to simulate huge hashed
sets of integers rather than by using OCaml's native Hashtbl?

Tibor
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 11:20 [Caml-list] intersecting huge integer sets Tibor Simko
@ 2002-08-27 11:52 ` Markus Mottl
  2002-08-27 12:09   ` Diego Olivier Fernandez Pons
  2002-08-27 12:06 ` Yaron M. Minsky
  2002-08-29 10:13 ` Diego Olivier Fernandez Pons
  2 siblings, 1 reply; 8+ messages in thread
From: Markus Mottl @ 2002-08-27 11:52 UTC (permalink / raw)
  To: Tibor Simko; +Cc: caml-list

Tibor Simko schrieb am Dienstag, den 27. August 2002:
> I wonder how to improve OCaml performance in this special case.
> Is there a better suited OCaml data structure to simulate huge hashed
> sets of integers rather than by using OCaml's native Hashtbl?

Yes, you'll probably want Patricia trees. You can get them from
Jean-Christophe Filliatre's page:

  http://www.lri.fr/~filliatr/software.en.html

They support very fast set operations (union, difference, intersection).

Best regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 11:20 [Caml-list] intersecting huge integer sets Tibor Simko
  2002-08-27 11:52 ` Markus Mottl
@ 2002-08-27 12:06 ` Yaron M. Minsky
  2002-08-27 12:12   ` Markus Mottl
  2002-08-29 10:13 ` Diego Olivier Fernandez Pons
  2 siblings, 1 reply; 8+ messages in thread
From: Yaron M. Minsky @ 2002-08-27 12:06 UTC (permalink / raw)
  To: Tibor Simko; +Cc: caml-list

If the sets are really as dense as you say they are (on the order of 50%
of all possible elements), then the fastest thing is probably just to
use a bool array.  Such an approach would require 3 passes:   one to set
all the elements of the array to false.  One to mark the elements
corresponding to set 1 to true, and then one pass over set 2 to see
which of its elements are shared by 1.

y

On Tue, 2002-08-27 at 07:20, Tibor Simko wrote:
> Hello
> 
> I wonder what is the OCaml canonical way to get fast intersection of
> huge sets of integers: say, to intersect two sets of 500000 items
> containing distinct integer values from the (0,1000000) interval.  
> 
> I quickly tested the Hashtbl way, along the lines:
> 
> +---- 
> |    let dsize d = 
> |      (* return length of dictionary *)
> |      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
> | 
> |    let dcreate nitems maxval =     
> |      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
> |      let d = Hashtbl.create 0 and
> |        size = ref 0 and
> |        rnd = ref 0 in
> |        while !size < nitems && !size < maxval do
> |          rnd := Random.int maxval;
> |          if Hashtbl.mem d !rnd 
> |          then ()
> |          else (incr size;
> |                Hashtbl.add d !rnd 1)
> |        done;
> |      d;;
> | 
> |    let dintersect d1 d2 =
> |      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
> |      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
> |      Hashtbl.iter (fun key _ ->
> |                      if Hashtbl.mem d2 key then ()
> |                      else Hashtbl.remove d1 key) d1;;
> | 
> |    (* main *)
> |    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
> |        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
> |    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
> |    flush stdout;
> |    let d1 = dcreate nitems maxval and
> |        d2 = dcreate nitems maxval in
> |      let t1 = Sys.time () in
> |        dintersect d1 d2;
> |        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
> +----
> 
> I found that the dintersect function appears to be about 3-4 times
> faster than its Python equivalent, and about 10% faster than Java's
> native HashSet retainAll() method.
> 
> I wonder how to improve OCaml performance in this special case.  Is
> there a better suited OCaml data structure to simulate huge hashed
> sets of integers rather than by using OCaml's native Hashtbl?
> 
> Tibor
> -------------------
> To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
> Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
-- 
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|

Open PGP --- KeyID B1FFD916 (new key as of Dec 4th)
Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 11:52 ` Markus Mottl
@ 2002-08-27 12:09   ` Diego Olivier Fernandez Pons
  0 siblings, 0 replies; 8+ messages in thread
From: Diego Olivier Fernandez Pons @ 2002-08-27 12:09 UTC (permalink / raw)
  To: Markus Mottl; +Cc: caml-list

Markus Mottle a écrit :

> Yes, you'll probably want Patricia trees. You can get them from
> Jean-Christophe Filliatre's page:
> 
>   http://www.lri.fr/~filliatr/software.en.html
> 
> They support very fast set operations (union, difference, intersection).

Patricia Trees are tries in dimension 2.

In fact, one could try tries in dimension 30 with the leafs being
integers (a kind of mix between bitvectors and tries).

I have been doing a few test, a very simple implementation should be
avaible in the pre-release of the data structure library. But I don't
know yet if it is faster than Patricia trees or not. At least it
should save some space

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 12:06 ` Yaron M. Minsky
@ 2002-08-27 12:12   ` Markus Mottl
  2002-08-28 12:29     ` Tibor Simko
  0 siblings, 1 reply; 8+ messages in thread
From: Markus Mottl @ 2002-08-27 12:12 UTC (permalink / raw)
  To: Yaron M. Minsky; +Cc: Tibor Simko, caml-list

On Tue, 27 Aug 2002, Yaron M. Minsky wrote:
> If the sets are really as dense as you say they are (on the order of 50%
> of all possible elements), then the fastest thing is probably just to
> use a bool array.  Such an approach would require 3 passes:   one to set
> all the elements of the array to false.  One to mark the elements
> corresponding to set 1 to true, and then one pass over set 2 to see
> which of its elements are shared by 1.

If this is what is needed, better take bit vectors ("bitv" - also on
Jean-Christophe's page).

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 12:12   ` Markus Mottl
@ 2002-08-28 12:29     ` Tibor Simko
  0 siblings, 0 replies; 8+ messages in thread
From: Tibor Simko @ 2002-08-28 12:29 UTC (permalink / raw)
  To: Yaron M. Minsky; +Cc: caml-list

Hello

Thanks for all the suggestions.  Here's a little summary [figures
below obtained by studying some special cases]:

As for Ptset, I found that Patricia trees are good in sparse
situations only: here they may be about 2x faster than Hashtbl.
However, the situation to optimize is rather the dense set
intersection performance, as said in my previous example.  Here Ptset
often performs 3x slower than Hashtbl: even the ordinary Set module
intersection is often faster than Ptset's one.  Overall, having tried
several sparse-dense situations, I found that Hashtbl sets perform
much better than Ptset sets.

As for Bitv, the set operations are indeed very fast for dense sets,
often an order of magnitude faster than Hashtbl.  For sparse sets,
Hashtbl may be faster but Bitv performance is acceptable here.  And,
since marshaling of Bitv vectors is blazingly fast too (often two
order of magnitudes faster than Hashtbl), it looks like Bitv is the
ideal overall data structure for my problem. :-)

Tibor
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] intersecting huge integer sets
  2002-08-27 11:20 [Caml-list] intersecting huge integer sets Tibor Simko
  2002-08-27 11:52 ` Markus Mottl
  2002-08-27 12:06 ` Yaron M. Minsky
@ 2002-08-29 10:13 ` Diego Olivier Fernandez Pons
  2002-08-29 11:33   ` [Caml-list] barre dans le filtrage de motifs Diego Olivier Fernandez Pons
  2 siblings, 1 reply; 8+ messages in thread
From: Diego Olivier Fernandez Pons @ 2002-08-29 10:13 UTC (permalink / raw)
  To: Tibor Simko; +Cc: caml-list

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1098 bytes --]

    Bonjour,

I made a small module with tries in base n where n is inferior to the
native integer size. I supports big-endian tries and little-endian
tries (in fact, user defined decomposition)

avaible functions :
- insert (log n)
- delete (log n)
- from_list (n log n)
- to_list (unknown)
- union (shoud be n)
- intersection (should be n)

I do not know if they are faster or not than bit-vectors or patricia
trees, but they save a lot of space

A set of integers from 0 to 9999 (I tried with 1 000 000 items as you
suggested but my function never ended. I remember a computation by
Xavier Leroy showing that a counter could hardly overflow)

(results given by JC Fillâtre's module Size)

avl set : 200 000 bytes
red-black set : 160 012 bytes
30 bit trie big-endian : 5 524 bytes
31 bit trie big-endian : 5 312 bytes
30 bit trie little-endian : 15 020 bytes
31 bit trie little-endian : 16 016 bytes

The implementation is not speed optimized. In fact one could also save
some more space compacting bits for bases inferior to int_size / 2

        Diego Olivier

[-- Attachment #2: bitSet --]
[-- Type: APPLICATION/octet-stream, Size: 1178 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Caml-list] barre dans le filtrage de motifs
  2002-08-29 10:13 ` Diego Olivier Fernandez Pons
@ 2002-08-29 11:33   ` Diego Olivier Fernandez Pons
  0 siblings, 0 replies; 8+ messages in thread
From: Diego Olivier Fernandez Pons @ 2002-08-29 11:33 UTC (permalink / raw)
  To: caml-list

    Bonjour,

Andreas Rossberg m'a demandé de quand datait la barre optionnelle
devant le premier terme d'un filtrage de motif et je n'ai pas su
lui répondre.

let f = function
  | 0 -> 0
  | ...

Tout au plus ai-je pu constater que dans caml-light 0.74 elle ne
figurait pas encore. Le registre des version d'Objective Caml (fichier
Changes) ne semble pas l'indiquer non plus, ou du moins je ne l'ai pas
vu.

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-08-29 11:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-27 11:20 [Caml-list] intersecting huge integer sets Tibor Simko
2002-08-27 11:52 ` Markus Mottl
2002-08-27 12:09   ` Diego Olivier Fernandez Pons
2002-08-27 12:06 ` Yaron M. Minsky
2002-08-27 12:12   ` Markus Mottl
2002-08-28 12:29     ` Tibor Simko
2002-08-29 10:13 ` Diego Olivier Fernandez Pons
2002-08-29 11:33   ` [Caml-list] barre dans le filtrage de motifs Diego Olivier Fernandez Pons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).