* Re: [Caml-list] The invert Benchmark
2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
@ 2002-04-17 20:04 ` Xavier Leroy
2002-04-17 20:28 ` Nadji.Gauthier
2002-04-18 19:07 ` [Caml-list] " Christophe TROESTLER
2002-04-17 20:13 ` [Caml-list] " Remi VANICAT
` (2 subsequent siblings)
3 siblings, 2 replies; 9+ messages in thread
From: Xavier Leroy @ 2002-04-17 20:04 UTC (permalink / raw)
To: Christophe TROESTLER; +Cc: O'Caml Mailing List
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/). As you
> will notice the Caml code (even compiled) performs poorly.
... and is poorly written.
> I guess
> part of the problem is due to using Map when Hashtbl is more suited.
Maybe not. Actually, using Map isn't such a bad idea in this
program, because maps are already sorted.
> So I tried to rewrite the code using Hashtbl (attached to this mail).
> What I got some trouble to figure out is how to get a list of the keys
> where each of the keys appears only once.
A useful trick is to organize your hashtable differently: make it map
strings to string list ref, e.g. have only one binding for each key,
but a list of associated values.
> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?
Why, just look in the standard library, of course. It's called
String.concat, and is a lot more efficient than the atrocious
implementation found in the benchmark.
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Caml-list] The invert Benchmark
2002-04-17 20:04 ` Xavier Leroy
@ 2002-04-17 20:28 ` Nadji.Gauthier
2002-04-18 19:07 ` [Caml-list] " Christophe TROESTLER
1 sibling, 0 replies; 9+ messages in thread
From: Nadji.Gauthier @ 2002-04-17 20:28 UTC (permalink / raw)
To: Xavier Leroy; +Cc: Christophe TROESTLER, O'Caml Mailing List
May I propose this program ? It uses a list of strings attached to
a string in a hashtbl as proposed by Xavier Leroy, uses the very useful
Pcre binding of Markus Mottl, doesn't need a join facility, and is
slightly faster than the one in perl.
Of course, any comment from the gurus are appreciated ...
exception Impossible
let process_input () =
let h = Hashtbl.create 127 in
let foldadd k d =
try
let y = Hashtbl.find h k in
Hashtbl.replace h k (d::y)
with _ -> Hashtbl.add h k [d]
in
try
let r = Pcre.regexp "\t" in
while true do
let l = input_line stdin in
(match Pcre.split ~rex:r l with
[a;b] -> foldadd b a
| _ -> failwith "bad input")
done;
raise Impossible
with _ -> h
let post_process h =
let l = Hashtbl.fold (fun k d l -> (k, List.sort compare d)::l) h [] in
List.sort (fun (a,_) (b, _) -> compare a b) l
let pprint =
List.iter (fun (k, d) ->
Printf.printf "%s" k;
List.iter (fun s -> Printf.printf "\t%s" s) d;
Printf.printf "\n";
)
let _ =
let h = process_input () in
let l = post_process h in
pprint l
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Caml-list] Re: The invert Benchmark
2002-04-17 20:04 ` Xavier Leroy
2002-04-17 20:28 ` Nadji.Gauthier
@ 2002-04-18 19:07 ` Christophe TROESTLER
2002-04-18 19:21 ` Mattias Waldau
1 sibling, 1 reply; 9+ messages in thread
From: Christophe TROESTLER @ 2002-04-18 19:07 UTC (permalink / raw)
To: O'Caml Mailing List
First of all, many thanks to all who replied -- I really enjoy being
on this list!
On Wed, 17 Apr 2002, Xavier Leroy <xavier.leroy@inria.fr> wrote:
>
> Maybe not. Actually, using Map isn't such a bad idea in this
> program, because maps are already sorted.
Yes, but can we rely on that since the documentation says:
The order in which the associated values are passed
is unspecified.
Does it mean it can change any time in the future or is the library
basically going to stay that way?
Best regards,
ChriS
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [Caml-list] Re: The invert Benchmark
2002-04-18 19:07 ` [Caml-list] " Christophe TROESTLER
@ 2002-04-18 19:21 ` Mattias Waldau
0 siblings, 0 replies; 9+ messages in thread
From: Mattias Waldau @ 2002-04-18 19:21 UTC (permalink / raw)
To: 'Christophe TROESTLER', 'O'Caml Mailing List'
The use splay-trees instead:
val iter: (key -> 'a -> unit) -> 'a t -> unit
(** [iter s f] applies [f] in increasing order to all elements
of [s] *)
/mattias
> -----Original Message-----
> From: owner-caml-list@pauillac.inria.fr
> [mailto:owner-caml-list@pauillac.inria.fr] On Behalf Of
> Christophe TROESTLER
> Sent: Thursday, April 18, 2002 9:08 PM
> To: O'Caml Mailing List
> Subject: [Caml-list] Re: The invert Benchmark
>
>
> First of all, many thanks to all who replied -- I really
> enjoy being on this list!
>
> On Wed, 17 Apr 2002, Xavier Leroy <xavier.leroy@inria.fr> wrote:
> >
> > Maybe not. Actually, using Map isn't such a bad idea in
> this program,
> > because maps are already sorted.
>
> Yes, but can we rely on that since the documentation says:
>
> The order in which the associated values are passed
> is unspecified.
>
> Does it mean it can change any time in the future or is the
> library basically going to stay that way?
>
> Best regards,
> ChriS
> -------------------
> To unsubscribe, mail caml-list-request@inria.fr Archives:
http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs
FAQ: http://caml.inria.fr/FAQ/ Beginner's list:
http://groups.yahoo.com/group/ocaml_beginners
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Caml-list] The invert Benchmark
2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
2002-04-17 20:04 ` Xavier Leroy
@ 2002-04-17 20:13 ` Remi VANICAT
2002-04-17 20:30 ` Markus Mottl
2002-04-18 9:11 ` Jean-Christophe Filliatre
3 siblings, 0 replies; 9+ messages in thread
From: Remi VANICAT @ 2002-04-17 20:13 UTC (permalink / raw)
To: caml-list
Christophe TROESTLER <debian00@tiscalinet.be> writes:
> Dear Caml riders,
>
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/). As you
> will notice the Caml code (even compiled) performs poorly. I guess
> part of the problem is due to using Map when Hashtbl is more suited.
Not exactly. Map can be very well suited if one remember that Map sort
its element.
> So I tried to rewrite the code using Hashtbl (attached to this mail).
> What I got some trouble to figure out is how to get a list of the keys
> where each of the keys appears only once. I eventually went the easy
> way. Anybody got better ideas to improve efficiency? Could a "keys"
> function be an interesting addition to Hashtbl???
>
> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?
very poorly efficient, a far better implementation for it is :
let join c l = String.concat c l
one can, of course, use directly String.concat....
[...]
it's often better to use a hastable of list ref than what you are
doing (that is : first, check if the key is already associated with a
list ref, then if it is the case add your element to the list ref,
otherwise, add a new binding with an list ref containing only the new
element).
--
Rémi Vanicat
vanicat@labri.u-bordeaux.fr
http://dept-info.labri.u-bordeaux.fr/~vanicat
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Caml-list] The invert Benchmark
2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
2002-04-17 20:04 ` Xavier Leroy
2002-04-17 20:13 ` [Caml-list] " Remi VANICAT
@ 2002-04-17 20:30 ` Markus Mottl
2002-04-17 21:19 ` Mattias Waldau
2002-04-18 9:11 ` Jean-Christophe Filliatre
3 siblings, 1 reply; 9+ messages in thread
From: Markus Mottl @ 2002-04-17 20:30 UTC (permalink / raw)
To: Christophe TROESTLER; +Cc: O'Caml Mailing List
On Wed, 17 Apr 2002, Christophe TROESTLER wrote:
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/). As you
> will notice the Caml code (even compiled) performs poorly.
Looking at the code this does not surprise me...
> Anybody got better ideas to improve efficiency? Could a "keys"
> function be an interesting addition to Hashtbl???
This should be about 10x faster than the initial version:
---------------------------------------------------------------------------
let report_url url = print_string ("\t" ^ url)
let report (file, url_lst) =
let url_ar = Array.of_list url_lst in
Array.sort compare url_ar;
print_string file; Array.iter report_url url_ar; print_char '\n'
let _ =
let table = Hashtbl.create 10000 in
let cnt_ref = ref 0 in
try
while true do
let line = input_line stdin in
let ix = String.index line '\t' in
let url = String.sub line 0 ix in
let file = String.sub line (ix + 1) (String.length line - ix - 1) in
try
let urls_ref = Hashtbl.find table file in
urls_ref := url :: !urls_ref
with Not_found -> incr cnt_ref; Hashtbl.add table file (ref [url])
done
with
| End_of_file ->
let ar = Array.make !cnt_ref ("", []) in
let coll file urls_ref cnt = ar.(cnt) <- file, !urls_ref; cnt - 1 in
ignore (Hashtbl.fold coll table (!cnt_ref - 1));
Array.sort compare ar; Array.iter report ar
| Not_found -> failwith "bad data"
---------------------------------------------------------------------------
> Another related question that popped up is: how to _efficiently_
> implement a join operation (join : string -> string list -> string is
> defined by: join c [s1;...;sn] = s1 ^ c ^ ... ^ c ^ sn) ?
Look up the function "String.concat" in the standard library...
Regards,
Markus Mottl
--
Markus Mottl markus@oefai.at
Austrian Research Institute
for Artificial Intelligence http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Caml-list] The invert Benchmark
2002-04-17 18:26 [Caml-list] The invert Benchmark Christophe TROESTLER
` (2 preceding siblings ...)
2002-04-17 20:30 ` Markus Mottl
@ 2002-04-18 9:11 ` Jean-Christophe Filliatre
3 siblings, 0 replies; 9+ messages in thread
From: Jean-Christophe Filliatre @ 2002-04-18 9:11 UTC (permalink / raw)
To: Christophe TROESTLER
Christophe TROESTLER writes:
> Dear Caml riders,
>
> I found by chance the "The invert Benchmark"
> (http://www.lib.uchicago.edu/keith/crisis/benchmarks/invert/). As you
> will notice the Caml code (even compiled) performs poorly. I guess
> part of the problem is due to using Map when Hashtbl is more suited.
> So I tried to rewrite the code using Hashtbl (attached to this mail).
> What I got some trouble to figure out is how to get a list of the keys
> where each of the keys appears only once. I eventually went the easy
> way. Anybody got better ideas to improve efficiency? Could a "keys"
> function be an interesting addition to Hashtbl???
As suggested by Xavier, String.concat substituted for the join
function really helps.
A simple profiling with gprof also showed that both in your code and
in the original one, a lot of time was spent building regexps!
Factorizing out Str.{regexp,regexp_string} "\t" (i.e. doing it only
once in a global variable) really improves performances.
Anyway, using Str.split to cut a string in two is a bit costly; doing
it directly as Markus did also improves performances.
--
Jean-Christophe Filliâtre (http://www.lri.fr/~filliatr)
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 9+ messages in thread