caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] The invert Benchmark
@ 2002-04-17 18:26 Christophe TROESTLER
  2002-04-17 20:04 ` Xavier Leroy
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Christophe TROESTLER @ 2002-04-17 18:26 UTC (permalink / raw)
  To: O'Caml Mailing List

[-- Attachment #1: Type: Text/Plain, Size: 812 bytes --]

Dear Caml riders,

I found by chance the "The invert Benchmark"
(http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/).  As you
will notice the Caml code (even compiled) performs poorly.  I guess
part of the problem is due to using Map when Hashtbl is more suited.
So I tried to rewrite the code using Hashtbl (attached to this mail).
What I got some trouble to figure out is how to get a list of the keys
where each of the keys appears only once.  I eventually went the easy
way.  Anybody got better ideas to improve efficiency?  Could a "keys"
function be an interesting addition to Hashtbl???

Another related question that popped up is: how to _efficiently_
implement a join operation (join : string -> string list -> string is
defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?

Cheers,
ChriS

[-- Attachment #2: Code for "The invert Benchmark" --]
[-- Type: Text/Plain, Size: 857 bytes --]

open Printf

let process_input init_size =
  let table = Hashtbl.create init_size
  and split line = match Str.split (Str.regexp_string "\t") line with
    | [a; b] -> (b, a)
    | _ -> failwith "Bad file format" in
  let rec loop() =
    let line = read_line() in 
    let b, a = split line in (Hashtbl.add table b a;  loop())
  in try
      loop()
    with
      | End_of_file -> table


let get_keys t = 
  let module S = Set.Make (struct type t = string 
				  let compare = compare 
			   end) in
    S.elements(Hashtbl.fold (fun k d l -> S.add k l) t S.empty)

let print table =
  let keys = List.sort compare (get_keys table) in
    List.iter (fun b ->
 		 printf "%s" b;
		 let l = List.sort compare (Hashtbl.find_all table b) in
 		   List.iter (fun s -> printf "\t%s" s) l;
		   print_newline()
	      ) keys

let () =
  print(process_input 2000)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] The invert Benchmark
  2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
@ 2002-04-17 20:04 ` Xavier Leroy
  2002-04-17 20:28   ` Nadji.Gauthier
  2002-04-18 19:07   ` [Caml-list] " Christophe TROESTLER
  2002-04-17 20:13 ` [Caml-list] " Remi VANICAT
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Xavier Leroy @ 2002-04-17 20:04 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: O'Caml Mailing List

> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/).  As you
> will notice the Caml code (even compiled) performs poorly. 

... and is poorly written.

> I guess
> part of the problem is due to using Map when Hashtbl is more suited.

Maybe not.  Actually, using Map isn't such a bad idea in this
program, because maps are already sorted.

> So I tried to rewrite the code using Hashtbl (attached to this mail).
> What I got some trouble to figure out is how to get a list of the keys
> where each of the keys appears only once.

A useful trick is to organize your hashtable differently: make it map
strings to string list ref, e.g. have only one binding for each key,
but a list of associated values.

> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?

Why, just look in the standard library, of course.  It's called 
String.concat, and is a lot more efficient than the atrocious
implementation found in the benchmark.

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] The invert Benchmark
  2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
  2002-04-17 20:04 ` Xavier Leroy
@ 2002-04-17 20:13 ` Remi VANICAT
  2002-04-17 20:30 ` Markus Mottl
  2002-04-18  9:11 ` Jean-Christophe Filliatre
  3 siblings, 0 replies; 9+ messages in thread
From: Remi VANICAT @ 2002-04-17 20:13 UTC (permalink / raw)
  To: caml-list

Christophe TROESTLER <debian00@tiscalinet.be> writes:

> Dear Caml riders,
> 
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/).  As you
> will notice the Caml code (even compiled) performs poorly.  I guess
> part of the problem is due to using Map when Hashtbl is more suited.

Not exactly. Map can be very well suited if one remember that Map sort
its element. 

> So I tried to rewrite the code using Hashtbl (attached to this mail).
> What I got some trouble to figure out is how to get a list of the keys
> where each of the keys appears only once.  I eventually went the easy
> way.  Anybody got better ideas to improve efficiency?  Could a "keys"
> function be an interesting addition to Hashtbl???
> 
> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?

very poorly efficient, a far better implementation for it is :

let join c l = String.concat c l

one can, of course, use directly String.concat....

[...]

it's often better to use a hastable of list ref than what you are
doing (that is : first, check if the key is already associated with a
list ref, then if it is the case add your element to the list ref,
otherwise, add a new binding with an list ref containing only the new
element). 
-- 
Rémi Vanicat
vanicat@labri.u-bordeaux.fr
http://dept-info.labri.u-bordeaux.fr/~vanicat
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] The invert Benchmark
  2002-04-17 20:04 ` Xavier Leroy
@ 2002-04-17 20:28   ` Nadji.Gauthier
  2002-04-18 19:07   ` [Caml-list] " Christophe TROESTLER
  1 sibling, 0 replies; 9+ messages in thread
From: Nadji.Gauthier @ 2002-04-17 20:28 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: Christophe TROESTLER, O'Caml Mailing List

May I propose this program ? It uses a list of strings attached to
a string in a hashtbl as proposed by Xavier Leroy, uses the very useful
Pcre binding of Markus Mottl, doesn't need a join facility, and is
slightly faster than the one in perl.
Of course, any comment from the gurus are appreciated ...

exception Impossible

let process_input () = 
  let h = Hashtbl.create 127 in
  let foldadd k d =
    try 
      let y = Hashtbl.find h k in
      Hashtbl.replace h k (d::y)
    with _ -> Hashtbl.add h k [d]
  in    
  try
    let r = Pcre.regexp "\t" in
    while true do
      let l = input_line stdin in
      (match Pcre.split ~rex:r l with
	 [a;b] -> foldadd b a
       | _ -> failwith "bad input")
    done;
    raise Impossible
  with _ -> h

let post_process h = 
  let l = Hashtbl.fold (fun k d l -> (k, List.sort compare d)::l) h [] in
  List.sort (fun (a,_) (b, _) -> compare a b) l

let pprint = 
  List.iter (fun (k, d) -> 
	       Printf.printf "%s" k; 
	       List.iter (fun s -> Printf.printf "\t%s" s) d;
	       Printf.printf "\n";
	    )

let _ =
  let h = process_input () in
  let l = post_process h in
  pprint l
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] The invert Benchmark
  2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
  2002-04-17 20:04 ` Xavier Leroy
  2002-04-17 20:13 ` [Caml-list] " Remi VANICAT
@ 2002-04-17 20:30 ` Markus Mottl
  2002-04-17 21:19   ` Mattias Waldau
  2002-04-18  9:11 ` Jean-Christophe Filliatre
  3 siblings, 1 reply; 9+ messages in thread
From: Markus Mottl @ 2002-04-17 20:30 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: O'Caml Mailing List

On Wed, 17 Apr 2002, Christophe TROESTLER wrote:
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/).  As you
> will notice the Caml code (even compiled) performs poorly.

Looking at the code this does not surprise me...

> Anybody got better ideas to improve efficiency?  Could a "keys"
> function be an interesting addition to Hashtbl???

This should be about 10x faster than the initial version:

---------------------------------------------------------------------------
let report_url url = print_string ("\t" ^ url)

let report (file, url_lst) =
  let url_ar = Array.of_list url_lst in
  Array.sort compare url_ar;
  print_string file; Array.iter report_url url_ar; print_char '\n'

let _ =
  let table = Hashtbl.create 10000 in
  let cnt_ref = ref 0 in
  try
    while true do
      let line = input_line stdin in
      let ix = String.index line '\t' in
      let url = String.sub line 0 ix in
      let file = String.sub line (ix + 1) (String.length line - ix - 1) in
      try
        let urls_ref = Hashtbl.find table file in
        urls_ref := url :: !urls_ref
      with Not_found -> incr cnt_ref; Hashtbl.add table file (ref [url])
    done
  with
  | End_of_file ->
    let ar = Array.make !cnt_ref ("", []) in
    let coll file urls_ref cnt = ar.(cnt) <- file, !urls_ref; cnt - 1 in
    ignore (Hashtbl.fold coll table (!cnt_ref - 1));
    Array.sort compare ar; Array.iter report ar
  | Not_found -> failwith "bad data"
---------------------------------------------------------------------------

> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?

Look up the function "String.concat" in the standard library...

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [Caml-list] The invert Benchmark
  2002-04-17 20:30 ` Markus Mottl
@ 2002-04-17 21:19   ` Mattias Waldau
  0 siblings, 0 replies; 9+ messages in thread
From: Mattias Waldau @ 2002-04-17 21:19 UTC (permalink / raw)
  To: 'Markus Mottl', 'Christophe TROESTLER'
  Cc: 'O'Caml Mailing List'

I followed his Python-code, which was much better than
his Ocaml-code.

/mattias

module Ord_string =
  struct
    type t = string
    let compare = compare
  end ;;


module S = Splay.Make(Ord_string) ;;

let tab_pattern = (Str.regexp_string "\t") in
let b = S.create () in
try
  while true do
    let line = input_line stdin in
    match Str.split tab_pattern line with
    | v :: k :: _ ->
	(try
	  S.set b k (v :: (S.find b k))
	with Not_found ->
	  S.set b k [v])
    | _ -> failwith "Not a pair"
  done
with End_of_file -> ();
  S.iter (fun k vs -> 
    let vs = List.sort compare vs in
    Printf.printf "%s\t%s\n" k (String.concat "\t" vs))
    b ;;




-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] The invert Benchmark
  2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
                   ` (2 preceding siblings ...)
  2002-04-17 20:30 ` Markus Mottl
@ 2002-04-18  9:11 ` Jean-Christophe Filliatre
  3 siblings, 0 replies; 9+ messages in thread
From: Jean-Christophe Filliatre @ 2002-04-18  9:11 UTC (permalink / raw)
  To: Christophe TROESTLER


Christophe TROESTLER writes:
 > Dear Caml riders,
 > 
 > I found by chance the "The invert Benchmark"
 > (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/).  As you
 > will notice the Caml code (even compiled) performs poorly.  I guess
 > part of the problem is due to using Map when Hashtbl is more suited.
 > So I tried to rewrite the code using Hashtbl (attached to this mail).
 > What I got some trouble to figure out is how to get a list of the keys
 > where each of the keys appears only once.  I eventually went the easy
 > way.  Anybody got better ideas to improve efficiency?  Could a "keys"
 > function be an interesting addition to Hashtbl???

As  suggested  by  Xavier,  String.concat  substituted  for  the  join
function really helps.

A simple profiling  with gprof also showed that both  in your code and
in  the  original one,  a  lot of  time  was  spent building  regexps!
Factorizing  out Str.{regexp,regexp_string} "\t"  (i.e. doing  it only
once in a global variable) really improves performances.

Anyway, using Str.split to cut a  string in two is a bit costly; doing
it directly as Markus did also improves performances.

-- 
Jean-Christophe Filliâtre (http://www.lri.fr/~filliatr)

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Caml-list] Re: The invert Benchmark
  2002-04-17 20:04 ` Xavier Leroy
  2002-04-17 20:28   ` Nadji.Gauthier
@ 2002-04-18 19:07   ` Christophe TROESTLER
  2002-04-18 19:21     ` Mattias Waldau
  1 sibling, 1 reply; 9+ messages in thread
From: Christophe TROESTLER @ 2002-04-18 19:07 UTC (permalink / raw)
  To: O'Caml Mailing List

First of all, many thanks to all who replied -- I really enjoy being
on this list!

On Wed, 17 Apr 2002, Xavier Leroy <xavier.leroy@inria.fr> wrote:
> 
> Maybe not.  Actually, using Map isn't such a bad idea in this
> program, because maps are already sorted.

Yes, but can we rely on that since the documentation says:

	The order in which the associated values are passed
	is unspecified.

Does it mean it can change any time in the future or is the library
basically going to stay that way?

Best regards,
ChriS
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [Caml-list] Re: The invert Benchmark
  2002-04-18 19:07   ` [Caml-list] " Christophe TROESTLER
@ 2002-04-18 19:21     ` Mattias Waldau
  0 siblings, 0 replies; 9+ messages in thread
From: Mattias Waldau @ 2002-04-18 19:21 UTC (permalink / raw)
  To: 'Christophe TROESTLER', 'O'Caml Mailing List'

The use splay-trees instead:

    val iter: (key -> 'a -> unit) -> 'a t -> unit
        (** [iter s f] applies [f] in increasing order to all elements
of [s] *)

/mattias

> -----Original Message-----
> From: owner-caml-list@pauillac.inria.fr 
> [mailto:owner-caml-list@pauillac.inria.fr] On Behalf Of 
> Christophe TROESTLER
> Sent: Thursday, April 18, 2002 9:08 PM
> To: O'Caml Mailing List
> Subject: [Caml-list] Re: The invert Benchmark
> 
> 
> First of all, many thanks to all who replied -- I really 
> enjoy being on this list!
> 
> On Wed, 17 Apr 2002, Xavier Leroy <xavier.leroy@inria.fr> wrote:
> > 
> > Maybe not.  Actually, using Map isn't such a bad idea in 
> this program, 
> > because maps are already sorted.
> 
> Yes, but can we rely on that since the documentation says:
> 
> 	The order in which the associated values are passed
> 	is unspecified.
> 
> Does it mean it can change any time in the future or is the 
> library basically going to stay that way?
> 
> Best regards,
> ChriS
> -------------------
> To unsubscribe, mail caml-list-request@inria.fr Archives: 
http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs
FAQ: http://caml.inria.fr/FAQ/ Beginner's list:
http://groups.yahoo.com/group/ocaml_beginners

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-04-18 21:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
2002-04-17 20:04 ` Xavier Leroy
2002-04-17 20:28   ` Nadji.Gauthier
2002-04-18 19:07   ` [Caml-list] " Christophe TROESTLER
2002-04-18 19:21     ` Mattias Waldau
2002-04-17 20:13 ` [Caml-list] " Remi VANICAT
2002-04-17 20:30 ` Markus Mottl
2002-04-17 21:19   ` Mattias Waldau
2002-04-18  9:11 ` Jean-Christophe Filliatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).