caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] what is the functional way to solve this problem?
@ 2003-10-02 12:31 Ram Bhamidipaty
  2003-10-02 14:53 ` Michal Moskal
  0 siblings, 1 reply; 6+ messages in thread
From: Ram Bhamidipaty @ 2003-10-02 12:31 UTC (permalink / raw)
  To: caml-list

Someone suggested that my previous question about dynamically resizing
arrays hinted that my solution may be going in a non-functional
direction. That might be true.

So here is the problem I am trying to solve. I would like to solve it
in a "functional way".

I want to create an in-memory representation of some file system data. 
The input data file has three different types of lines in the file:

1. starts with "R": R 0 <dirname>

2. starts with "D": D <dir_num> <parent_dir_num> <name>
   The <dir_num> associated with each directory are
   sequentially assigned starting at 1.

3. starts with "F": F <file_size> <name>

The first line will always be: "R 0 <dir_name>" where 0 is the
directory number for the top level directory. This purpose of this
line is provide the starting point for the data in the rest of the
file.

The D lines indicate directories. The dir_num is an integer that
uniquely identifies this particular directory. The parent_dir_num
integer is used to locate this directory relative to the other
directories in the data file.

The F lines indicate data for a single file. The majority of the
lines in the file should be F lines. A file listed on an F line
is in the directory indicated by the closest D line that came
earlier in the file.

Once all the data is read in I want to output: A list of the top 100 largest
files, A list of the top 100 directories that contain the largest fraction
of the total disk space used by all the files in the data file - and in this
case the file size for a directory does not include sub-directories. Eventually
I want to also categorize the data by user-id as well.

I have already written a python application that can read this data file and
generate the data I am looking for. I was not happy with the python solution
because it was not very fast. Even with using a heap to store the top 100
largest files I was not able to create a python solution that could beat
a "grep | sort" pipeline (on unix of course).

In the python solution the limiting factor was putting the individual
files into a heap and repeatedly calling delete_min on the heap to
remove the smallest files. Even though the unix pipe based solution
ended up sorting _all_ the files and the python solution was handling
a smaller set of data the unix pipe solution was still much faster. 
There is a thread in google about this experiment. Do a google groups
search for group:comp.lang.python + bhamidipaty + "looking for speed-up ideas".

The bottom line for the python solution was that the grep+sort
pipeline took about 8 seconds and the fastest I could get python was
around 16 seconds. Of course the unix pipe solution would not be able
to do the other analysis that I wanted.

My goal of implementing this in OCaml is to beat the grep+sort
combination. If I can create a solution that can output all the
date I want - in one pass - AND still be faster than the grep+sort
partial solution - _that_ would be cool!

Having said all that - I wanted to use an array to hold the data for
each directory. I hoped that using an array would be faster than a
hash table since I know that the directory numbers are assigned
sequentially.

Thanks for reading this rather long message and thank you for any
advice.
-Ram

---------------------------------
An example of the data file:

R 0 /usr
D 1 0 local
F 4095165 f1
D 2 1 bin
F 189408 f2
F 189445 f3
D 4 1 etc
F 3956 f4
D 5 1 info
F 2613 f5
F 50111 f6
D 6 1 lib
F 610422 f7
D 7 1 man
F 82097 f8
---------------------------------


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] what is the functional way to solve this problem?
  2003-10-02 12:31 [Caml-list] what is the functional way to solve this problem? Ram Bhamidipaty
@ 2003-10-02 14:53 ` Michal Moskal
  2003-10-08 14:48   ` Pierre Weis
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Moskal @ 2003-10-02 14:53 UTC (permalink / raw)
  To: Ram Bhamidipaty; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2415 bytes --]

On Thu, Oct 02, 2003 at 05:31:22AM -0700, Ram Bhamidipaty wrote:
> Someone suggested that my previous question about dynamically resizing
> arrays hinted that my solution may be going in a non-functional
> direction. That might be true.
> 
> So here is the problem I am trying to solve. I would like to solve it
> in a "functional way".

Use Map. You won't notice a difference.

[...]
> I have already written a python application that can read this data file and
> generate the data I am looking for. I was not happy with the python solution
> because it was not very fast. Even with using a heap to store the top 100
> largest files I was not able to create a python solution that could beat
> a "grep | sort" pipeline (on unix of course).

[...]

./stats < /shm/d  0.11s user 0.00s system 78% cpu 0.139 total

grep '^F' /shm/d  0.02s user 0.00s system 22% cpu 0.090 total
sort -n -t' ' -k2,2  0.32s user 0.05s system 71% cpu 0.517 total
tail -100  0.02s user 0.00s system 3% cpu 0.514 total

 86287  263948 1912177 /shm/d

Did I win something? ;-)

[...]

> My goal of implementing this in OCaml is to beat the grep+sort
> combination. If I can create a solution that can output all the
> date I want - in one pass - AND still be faster than the grep+sort
> partial solution - _that_ would be cool!
> 
> Having said all that - I wanted to use an array to hold the data for
> each directory. I hoped that using an array would be faster than a
> hash table since I know that the directory numbers are assigned
> sequentially.

Map! Map! Stop thinking in such an imperative way ;-)

Key points of my implementation:

1. It also uses heap (simple one from pure-fun)
2. It is compiled with ocamlopt (ocamlc version is 6 times slower)
3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or
   more times to parse input then actual computations.
4. There is little trick, that heap holds 101 elements, and always
   discards last one, so we don't need to check if newly inserted
   element is really bigger then min we just deleted.
5. The code isn't very pretty (it would be better to make heap functor
   with item count, and so on). It is left as an exercise to the reader.

You're free to use it. I also attach program I used to create input.

-- 
: Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
: When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h

[-- Attachment #2: stats.ml --]
[-- Type: text/plain, Size: 4105 bytes --]

exception Empty

module Heap (Element : Map.OrderedType) =
struct
  module Elem = Element

  type heap = E | T of int * Elem.t * heap * heap

  let rank = function E -> 0 | T (r,_,_,_) -> r

  let makeT x a b =
    if rank a >= rank b then T (rank b + 1, x, a, b)
    else T (rank a + 1, x, b, a)

  let empty = E
  let is_empty h = h = E

  let rec merge h1 h2 = match h1, h2 with
    | _, E -> h1
    | E, _ -> h2
    | T (_, x, a1, b1), T (_, y, a2, b2) ->
        if Elem.compare x y <= 0 then makeT x a1 (merge b1 h2)
        else makeT y a2 (merge h1 b2)

  let insert x h = merge (T (1, x, E, E)) h
  let find_min = function E -> raise Empty | T (_, x, _, _) -> x
  let delete_min = function E -> raise Empty | T (_, x, a, b) -> merge a b

  let flatten x =
    let rec loop acc = function
      | E -> acc
      | T (_, x, a, b) -> loop (loop (x :: acc) a) b
    in loop [] x
end


module Dir = struct
  type t =
    {
      size : int;
      name : string;
      parent : t option;
      no : int;
    }

  let compare x y = x.size - y.size
end

module Int = struct
  type t = int
  let compare x y = x - y
end

module File = struct
  type t =
    {
      size : int;
      dir : Dir.t;
      name : string;
    }

  let compare x y = x.size - y.size
end

module Dir_set = Map.Make(Int)
module Dir_heap = Heap(Dir)
module File_heap = Heap(File)

type line =
  | D of int * int * string
  | F of int * string
  | Eof

let limit = 100

let read () =
  try
    let l = read_line () in
    let rec scan_int pos =
      match l.[pos] with
      | '0' .. '9' -> scan_int (pos + 1)
      | _ -> pos
    in
    let int_from pos =
      let xend = scan_int pos in
      (int_of_string (String.sub l pos (xend - pos)), xend + 1)
    in
    match l.[0] with
    | 'D' ->
      let (x, p) = int_from 2 in
      let (y, p) = int_from p in
      let s = String.sub l p (String.length l - p) in
      D (x, y, s)
    | 'F' -> 
      let (x, p) = int_from 2 in
      let s = String.sub l p (String.length l - p) in
      F (x, s)
    | _ -> invalid_arg "read"
  with End_of_file -> Eof

let rec loop dh fh ds cur_dir = 
  let push_dir () =
    let (dh_sz, dh) = dh in
    if dh_sz > limit then 
      (dh_sz, Dir_heap.insert cur_dir (Dir_heap.delete_min dh))
    else 
      (dh_sz + 1, Dir_heap.insert cur_dir dh)
  in
  
  function
  | Eof ->
    (push_dir (), fh)
  | D (no, par, nam) ->
    let cur_dir =
      {
        Dir.parent = Some (Dir_set.find par ds);
        Dir.name = nam;
        Dir.size = 0;
        Dir.no = no;
      }
    in loop (push_dir ()) fh (Dir_set.add cur_dir.Dir.no cur_dir ds) cur_dir (read ())
  | F (sz, name) ->
    let f = { File.size = sz; File.name = name; File.dir = cur_dir } in
    let fh =
      let (fh_sz, fh) = fh in
      if fh_sz > limit then 
        (fh_sz, File_heap.insert f (File_heap.delete_min fh))
      else 
        (fh_sz + 1, File_heap.insert f fh)
    in loop dh fh ds {cur_dir with Dir.size = cur_dir.Dir.size + sz} (read ())

let main =
  let start nam =
    let dir = { Dir.name = nam; Dir.size = 0; Dir.no = 0; Dir.parent = None } in
    let ds = Dir_set.add 0 dir Dir_set.empty in
    loop (0, Dir_heap.empty) (0, File_heap.empty) ds dir (read ()) 
  in
  let l = read_line () in
  let (dh, fh) = Scanf.sscanf l "R 0 %[^\n]" start in
  let dir_name d =
    let rec loop acc = function
      | { Dir.name = n; Dir.parent = None } -> n :: acc
      | { Dir.name = n; Dir.parent = Some p } -> loop (n :: acc) p
    in String.concat "/" (loop [] d)
  in
  let print_dir d = 
    Printf.printf "%10d %s\n" d.Dir.size (dir_name d) 
  in
  let print_file f = 
    Printf.printf "%10d %s/%s\n" f.File.size (dir_name f.File.dir) f.File.name
  in
  let dh =
    let (dh_sz, dh) = dh in
    if dh_sz > limit then Dir_heap.delete_min dh
    else dh
  in 
  Printf.printf "DIRS:\n";
  List.iter print_dir (List.sort Dir.compare (Dir_heap.flatten dh));
  let fh =
    let (fh_sz, fh) = fh in
    if fh_sz > limit then File_heap.delete_min fh
    else fh
  in 
  Printf.printf "FILES:\n";
  List.iter print_file (List.sort File.compare (File_heap.flatten fh));

[-- Attachment #3: mkd.ml --]
[-- Type: text/plain, Size: 695 bytes --]

let id = ref 0

let rec go dir_name par_id =
  Unix.chdir dir_name;
  incr id;
  let id = !id in
  Printf.printf "D %d %d %s\n" id par_id dir_name;
  let d = Unix.opendir "." in
  let dirs = ref [] in
  let rec loop () =
    let f = Unix.readdir d in
    let st = Unix.lstat f in
    let _ =
      match f.[0], st.Unix.st_kind with
      | '.', _ -> ()
      | _, Unix.S_DIR -> dirs := f :: !dirs
      | _, Unix.S_REG -> Printf.printf "F %d %s\n" st.Unix.st_size f
      | _ -> ()
    in loop ()
  in
  begin
    List.iter (fun f -> go f id) 
      (try
        loop ()
       with End_of_file -> Unix.closedir d; !dirs)
  end;
  Unix.chdir ".."

let _ =
  Printf.printf "R 0 .\n";
  go "." 0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] what is the functional way to solve this problem?
  2003-10-02 14:53 ` Michal Moskal
@ 2003-10-08 14:48   ` Pierre Weis
  2003-10-08 16:09     ` Michal Moskal
  0 siblings, 1 reply; 6+ messages in thread
From: Pierre Weis @ 2003-10-08 14:48 UTC (permalink / raw)
  To: Michal Moskal; +Cc: ramb, caml-list

Hi Michal,

[...]
> Key points of my implementation:
[...]
> 3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or
>    more times to parse input then actual computations.
[...]
> -- 
> : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
> : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h

I'm sorry to report that I was so puzzled by your number 3) key point,
that I tested your code and a ``purely scanf'' version to see this
``10 or more times to parse'' behaviour; I didn't notice any runtime
difference between the two (native code compiled) versions. Am I
missing something ?

Could you please try to use the following version of your read
function, and report the runtime difference between your hand written
code ?

(BTW, I compiled the 2 programs using ocamlopt -unsafe -inline 9)

let read () =
  try
    Scanf.bscanf Scanf.Scanning.stdib " %c" (function
    | 'D' ->
        Scanf.bscanf Scanf.Scanning.stdib " %d %d %s"
        (fun x y s -> D (x, y, s))
    | 'F' ->
        Scanf.bscanf Scanf.Scanning.stdib " %d %s"
        (fun x s -> F (x, s))
    | _ -> invalid_arg "read")
  with End_of_file -> Eof

You should also replace
(*
  let l = read_line () in
  let (dh, fh) = Scanf.sscanf l "R 0 %[^\n]" start in
*)
by

  let (dh, fh) = Scanf.bscanf Scanf.Scanning.stdib "R 0 %s@\n" start in

Looking forward for reading from you,

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] what is the functional way to solve this problem?
  2003-10-08 14:48   ` Pierre Weis
@ 2003-10-08 16:09     ` Michal Moskal
  2003-10-08 20:34       ` Pierre Weis
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Moskal @ 2003-10-08 16:09 UTC (permalink / raw)
  To: Pierre Weis; +Cc: ramb, caml-list

On Wed, Oct 08, 2003 at 04:48:09PM +0200, Pierre Weis wrote:
> Hi Michal,
> 
> [...]
> > Key points of my implementation:
> [...]
> > 3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or
> >    more times to parse input then actual computations.
> [...]
> > -- 
> > : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
> > : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h
> 
> I'm sorry to report that I was so puzzled by your number 3) key point,
> that I tested your code and a ``purely scanf'' version to see this
> ``10 or more times to parse'' behaviour; I didn't notice any runtime
> difference between the two (native code compiled) versions. Am I
> missing something ?

Hm, with your version it's only 3 times slower:

-rw-r--r--    1 malekith users     1909887 Oct  8 18:02 /shm/d2

./statsf < /shm/d2 > /dev/null  0.31s user 0.01s system 103% cpu 0.308
total
./stats < /shm/d2 > /dev/null  0.10s user 0.00s system 155% cpu 0.064
total

However, your code is not correct:

> Could you please try to use the following version of your read
> function, and report the runtime difference between your hand written
> code ?
> 
> (BTW, I compiled the 2 programs using ocamlopt -unsafe -inline 9)
> 
> let read () =
>   try
>     Scanf.bscanf Scanf.Scanning.stdib " %c" (function
>     | 'D' ->
>         Scanf.bscanf Scanf.Scanning.stdib " %d %d %s"
----------------------------------------------------^^

It should be [^\n], as filenames can contain spaces. In this case:

./statsf < /shm/d2 > /dev/null  1.56s user 0.04s system 96% cpu 1.654
total

Well, it's 15 times slower.

All tests were run on athlon 1.56GHz, under pld-linux, ocaml 3.07,
ocamlopt -unsafe -inline 9.

-- 
: Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
: When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] what is the functional way to solve this problem?
  2003-10-08 16:09     ` Michal Moskal
@ 2003-10-08 20:34       ` Pierre Weis
  2003-10-08 21:22         ` scanf (Re: [Caml-list] what is the functional way to solve this problem?) Michal Moskal
  0 siblings, 1 reply; 6+ messages in thread
From: Pierre Weis @ 2003-10-08 20:34 UTC (permalink / raw)
  To: Michal Moskal; +Cc: pierre.weis, ramb, caml-list

[...]
> Hm, with your version it's only 3 times slower:
[...]
> However, your code is not correct:
> >         Scanf.bscanf Scanf.Scanning.stdib " %d %d %s"
> ----------------------------------------------------^^
> 
> It should be [^\n], as filenames can contain spaces. In this case:

I did not know the complete specification :)

However, your proposed patch is rather inefficient:

> It should be [^\n], as filenames can contain spaces. In this case:
---------------^^^^^

You should prefer %s@\n, if efficiency is a concern (which since to be the
case :)

> Well, it's 15 times slower.

Could you please let me have access to the data to be able to test the
code without bothering you ?

Thank you very much indeed for your testbed case, since you exhibit
situations where there is room for improvement in the implementation
of scanf (or at least there is the necessity to explain what are the
efficient pattern constructs for Scanf).

> All tests were run on athlon 1.56GHz, under pld-linux, ocaml 3.07,
> ocamlopt -unsafe -inline 9.
> -- 
> : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
> : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h

Thanks for the precision. I use a (now old) 700MHZ pentium III, but
since we just compare the efficiency of different versions of the same
program, I think that the machine hardware kind is not critical.

Best regards,

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* scanf (Re: [Caml-list] what is the functional way to solve this problem?)
  2003-10-08 20:34       ` Pierre Weis
@ 2003-10-08 21:22         ` Michal Moskal
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Moskal @ 2003-10-08 21:22 UTC (permalink / raw)
  To: Pierre Weis; +Cc: caml-list

On Wed, Oct 08, 2003 at 10:34:40PM +0200, Pierre Weis wrote:
> [...]
> > Hm, with your version it's only 3 times slower:
> [...]
> > However, your code is not correct:
> > >         Scanf.bscanf Scanf.Scanning.stdib " %d %d %s"
> > ----------------------------------------------------^^
> > 
> > It should be [^\n], as filenames can contain spaces. In this case:
> 
> I did not know the complete specification :)

Neither did I ;-) I just generated filesystem data from my /usr/share
and there was few files that contained spaces in names. It simply more
reasonable to assume filenames can contain spaces then to assume they
can't. OTOH filenames can as well contain \n... But we are talking about
scanf efficiency, so original problem becomes somewhat irrelevant :-)

> However, your proposed patch is rather inefficient:
> 
> > It should be [^\n], as filenames can contain spaces. In this case:
> ---------------^^^^^
> 
> You should prefer %s@\n, if efficiency is a concern (which since to be the
> case :)

I admit I didn't go that throughly through scanf manual. And there
seems to be no indication about efficiency of both solutions. But as I
understand [] thing would involve creating some flag array for all 256
characters, for each end every scanf call, which has to be slow. Maybe
some caching (for cases where [] really needs to be used)?

> > Well, it's 15 times slower.
> 
> Could you please let me have access to the data to be able to test the
> code without bothering you ?

Sure, but along with stats.ml I posted mkd.ml (from make-data). You can
run it on your /usr or /usr/share redirecting input to file. If you need
*my* test case, please drop me an email.

> Thank you very much indeed for your testbed case, since you exhibit
> situations where there is room for improvement in the implementation
> of scanf (or at least there is the necessity to explain what are the
> efficient pattern constructs for Scanf).

I generally not use Scanf. This is because I use OCaml mostly for some
compilers, preprocessors etc, that need more sophisticated lexing. It is
however nice when using OCaml to do some simple and/or scripting tasks.
I used Scanf mostly for programming contest, where input data is
formated into sequence of numbers or some simple strings. And found it
to be fast enough (you know, this is programming contest, they time you,
so you have to be fast). But I used only %d and maybe %s, mainly
through:

  let get_int () = Scanf.scanf " %d" (fun x -> x)

Which was more natural to me (didn't still convert myself to
all-functional-style-I'm-using-only-Haskell :-)

-- 
: Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv
: When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-10-08 21:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-02 12:31 [Caml-list] what is the functional way to solve this problem? Ram Bhamidipaty
2003-10-02 14:53 ` Michal Moskal
2003-10-08 14:48   ` Pierre Weis
2003-10-08 16:09     ` Michal Moskal
2003-10-08 20:34       ` Pierre Weis
2003-10-08 21:22         ` scanf (Re: [Caml-list] what is the functional way to solve this problem?) Michal Moskal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).