caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* zcat vs CamlZip
@ 2006-08-29 18:40 Sam Steingold
  2006-08-29 18:54 ` Bardur Arantsson
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sam Steingold @ 2006-08-29 18:40 UTC (permalink / raw)
  To: caml-list

I read through a huge *.gz file.
I have two versions of the code:

1. use Unix.open_process_in "zcat foo.gz".

2. use gzip.mli (1.2 2002/02/18) as comes with godi 3.09.

it turns out that the zcat version is 3(!) times as fast as the gzip.mli 
one:

Run time: 189.435840 sec
Self:     189.435840 sec
      sys: 183.447465 sec
     user: 5.988375 sec
Children: 0.000000 sec
      sys: 0.000000 sec
     user: 0.000000 sec
GC:     minor: 169778
         major: 478
   compactions: 3
Allocated:  5510457762.0 words
Wall clock:  206 sec (00:03:26)

vs

Run time: 58.471655 sec
Self:     54.855429 sec
      sys: 48.527033 sec
     user: 6.328396 sec
Children: 3.616226 sec
      sys: 3.168198 sec
     user: 0.448028 sec
GC:     minor: 43174
         major: 229
   compactions: 5
Allocated:  1401290543.0 words
Wall clock:  78 sec (00:01:18)

since gzip.mli lacks input_line function, I had to roll my own:

let buf = Buffer.create 1024
let gz_input_line gz_in char_counter line_counter =
   Buffer.clear buf;
   let finish () = incr line_counter; Buffer.contents buf in
   let rec loop () =
     let ch = Gzip.input_char gz_in in
     char_counter := Int64.succ !char_counter;
     if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); 
) in
   try loop ()
   with End_of_file ->
     if Buffer.length buf = 0 then raise End_of_file else finish ()

is there something wrong with my gz_input_line?
is this a know performance issue with the CamlZip library?

thanks.
Sam.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: zcat vs CamlZip
  2006-08-29 18:40 zcat vs CamlZip Sam Steingold
@ 2006-08-29 18:54 ` Bardur Arantsson
  2006-08-29 19:01   ` [Caml-list] " Florian Hars
                     ` (2 more replies)
  2006-08-29 19:11 ` [Caml-list] " Eric Cooper
  2006-08-30  6:12 ` Jeff Henrikson
  2 siblings, 3 replies; 12+ messages in thread
From: Bardur Arantsson @ 2006-08-29 18:54 UTC (permalink / raw)
  To: caml-list

Sam Steingold wrote:
> I read through a huge *.gz file.
> I have two versions of the code:
[--snip--]
> 
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
>   Buffer.clear buf;
>   let finish () = incr line_counter; Buffer.contents buf in
>   let rec loop () =
>     let ch = Gzip.input_char gz_in in

This is your most likely culprit. Any kind of "do this for every 
character" is usually insanely expensive when you can do it in bulk.
(This is especially true when needing to do system calls, or if the 
called function cannot be inlined.)

-- 
Bardur Arantsson
<bardurREMOVE@THISscientician.net>

If you can't join 'em, beat 'em. Preferably with a big stick.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-29 18:54 ` Bardur Arantsson
@ 2006-08-29 19:01   ` Florian Hars
  2006-08-29 19:15   ` Sam Steingold
  2006-08-29 19:37   ` John Carr
  2 siblings, 0 replies; 12+ messages in thread
From: Florian Hars @ 2006-08-29 19:01 UTC (permalink / raw)
  To: Bardur Arantsson; +Cc: caml-list

Bardur Arantsson schrieb:
> Sam Steingold wrote:
>>     let ch = Gzip.input_char gz_in in
> 
> This is your most likely culprit.

Apart from the fact that zcat is in fact at least twice as fast
as the ocaml gzip module.

Yours, Florian.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] zcat vs CamlZip
  2006-08-29 18:40 zcat vs CamlZip Sam Steingold
  2006-08-29 18:54 ` Bardur Arantsson
@ 2006-08-29 19:11 ` Eric Cooper
  2006-08-30  6:12 ` Jeff Henrikson
  2 siblings, 0 replies; 12+ messages in thread
From: Eric Cooper @ 2006-08-29 19:11 UTC (permalink / raw)
  To: caml-list

On Tue, Aug 29, 2006 at 02:40:23PM -0400, Sam Steingold wrote:
> is this a known performance issue with the CamlZip library?

I found the same thing when I was writing approx, so I use a "gunzip"
process with Sys.command.  (You can also use open_process_in, but I
just decompress to a temporary file and then reread it.  That also
catches corrupt .gz files in a more robust way.)

-- 
Eric Cooper             e c c @ c m u . e d u


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: zcat vs CamlZip
  2006-08-29 18:54 ` Bardur Arantsson
  2006-08-29 19:01   ` [Caml-list] " Florian Hars
@ 2006-08-29 19:15   ` Sam Steingold
  2006-08-29 19:48     ` Bárður Árantsson
                       ` (2 more replies)
  2006-08-29 19:37   ` John Carr
  2 siblings, 3 replies; 12+ messages in thread
From: Sam Steingold @ 2006-08-29 19:15 UTC (permalink / raw)
  To: Bardur Arantsson, caml-list

Bardur Arantsson wrote:
> Sam Steingold wrote:
>> I read through a huge *.gz file.
>> I have two versions of the code:
> [--snip--]
>>
>> let buf = Buffer.create 1024
>> let gz_input_line gz_in char_counter line_counter =
>>   Buffer.clear buf;
>>   let finish () = incr line_counter; Buffer.contents buf in
>>   let rec loop () =
>>     let ch = Gzip.input_char gz_in in
> 
> This is your most likely culprit. Any kind of "do this for every 
> character" is usually insanely expensive when you can do it in bulk.
> (This is especially true when needing to do system calls, or if the 
> called function cannot be inlined.)
> 

yes, I thought about it, but I assumed that the ocaml gzip module 
inlines  Gzip.input_char (obviously the gzip module needs an internal 
cache so Gzip.input_char does not _always_ translate to a system call, 
most of the time it just pops a char from the internal buffer).
at any rate, do you really expect that using Gzip.input and then 
searching the result for a newline, slicing and dicing to get the 
individual input lines, &c &c would be faster?

Sam.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-29 18:54 ` Bardur Arantsson
  2006-08-29 19:01   ` [Caml-list] " Florian Hars
  2006-08-29 19:15   ` Sam Steingold
@ 2006-08-29 19:37   ` John Carr
  2 siblings, 0 replies; 12+ messages in thread
From: John Carr @ 2006-08-29 19:37 UTC (permalink / raw)
  To: caml-list


> This is your most likely culprit. Any kind of "do this for every 
> character" is usually insanely expensive when you can do it in bulk.

I wrote a program that read data from a text file, which
could optionally be compressed.  I defined my text file
format to have nearly-fixed length lines so I could call
Gzip.really_input.  My program doesn't spend much of its
time reading the text file so I didn't spend much time
making input fast.  I just did what I thought the obvious
optimization of reading a block of characters in the
normal case.

let input_line =
  begin function
      Uncompressed c ->
 	input_line c
    | Compressed c ->
	begin match Gzip.input_char c with
	  '#' -> while Gzip.input_char c <> '\n' do () done; "#"
	| 'S' -> 
	    let buf = String.make 11 'S' in
	    Gzip.really_input c buf 1 10;
	    if String.unsafe_get buf 10 = '\n' then
	      String.unsafe_set buf 10 ' '
	    else begin
	      if Gzip.input_char c <> '\n' then
		failwith "bad override file"
	    end;
	    buf
	| _ -> failwith "bad override file"
	end
  end

(Lines are variable-length comments beginning '#' or data
lines beginning with 'S' followed by 9 or 10 characters.)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: zcat vs CamlZip
  2006-08-29 19:15   ` Sam Steingold
@ 2006-08-29 19:48     ` Bárður Árantsson
  2006-08-29 19:54     ` [Caml-list] " Gerd Stolpmann
  2006-08-29 20:04     ` Gerd Stolpmann
  2 siblings, 0 replies; 12+ messages in thread
From: Bárður Árantsson @ 2006-08-29 19:48 UTC (permalink / raw)
  To: caml-list

Sam Steingold wrote:
> Bardur Arantsson wrote:
>> Sam Steingold wrote:
>>> I read through a huge *.gz file.
>>> I have two versions of the code:
>> [--snip--]
>>>
>>> let buf = Buffer.create 1024
>>> let gz_input_line gz_in char_counter line_counter =
>>>   Buffer.clear buf;
>>>   let finish () = incr line_counter; Buffer.contents buf in
>>>   let rec loop () =
>>>     let ch = Gzip.input_char gz_in in
>>
>> This is your most likely culprit. Any kind of "do this for every 
>> character" is usually insanely expensive when you can do it in bulk.
>> (This is especially true when needing to do system calls, or if the 
>> called function cannot be inlined.)
>>
> 
> yes, I thought about it, but I assumed that the ocaml gzip module 
> inlines  Gzip.input_char (obviously the gzip module needs an internal 
> cache so Gzip.input_char does not _always_ translate to a system call, 
> most of the time it just pops a char from the internal buffer).

You can also easily try this in C with fgetc() contrasted with fgets(). 
The difference is _huge_ even if they both do comparable numbers of 
syscalls -- assuming that the buffering is identical (I haven't checked, 
but I think it is a reasonable assumption). In the C case, the inlining 
is not really guaranteed, but I don't think it is in OCaml either -- 
though I honestly don't know. You'd have to check the assembler output 
to see if the call gets inlined.

Inlining aside, memory prefecthing probably also makes a difference in 
favor of reading in bulk and then processing "in bulk".

> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

I would guess so, yes.

(There may of course be other reasons for a large portion of the 
difference as others have pointed out.)

-- 
Bardur Arantsson
<bardurREMOVE@THISscientician.net>

- 'Blackmail' is such an ugly word. I prefer 'extortion'. The X
makes it sound cool.
                                                Bender, 'Futurama'


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-29 19:15   ` Sam Steingold
  2006-08-29 19:48     ` Bárður Árantsson
@ 2006-08-29 19:54     ` Gerd Stolpmann
  2006-08-29 20:04     ` Gerd Stolpmann
  2 siblings, 0 replies; 12+ messages in thread
From: Gerd Stolpmann @ 2006-08-29 19:54 UTC (permalink / raw)
  To: Sam Steingold; +Cc: Bardur Arantsson, caml-list

Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> Bardur Arantsson wrote:
> > Sam Steingold wrote:
> >> I read through a huge *.gz file.
> >> I have two versions of the code:
> > [--snip--]
> >>
> >> let buf = Buffer.create 1024
> >> let gz_input_line gz_in char_counter line_counter =
> >>   Buffer.clear buf;
> >>   let finish () = incr line_counter; Buffer.contents buf in
> >>   let rec loop () =
> >>     let ch = Gzip.input_char gz_in in
> > 
> > This is your most likely culprit. Any kind of "do this for every 
> > character" is usually insanely expensive when you can do it in bulk.
> > (This is especially true when needing to do system calls, or if the 
> > called function cannot be inlined.)
> > 
> 
> yes, I thought about it, but I assumed that the ocaml gzip module 
> inlines  Gzip.input_char (obviously the gzip module needs an internal 
> cache so Gzip.input_char does not _always_ translate to a system call, 
> most of the time it just pops a char from the internal buffer).

This may be a godi issue, because gzip.cmx is not installed. Inlining
needs the .cmx file. However, I am not sure whether input_char can be
inlined at all. You can find that out with the dumpapprox tool:

dumpapprox path/to/foo.cmx

Look for the "Approximation" section. If the function (or better entry
point) is listed with the "(inline)" flag it can be inlined, otherwise
not.

> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

The question is whether you finally get a loop that can be completely
executed in the CPU's cache, and how many variables need to be read and
written in a loop cycle. Whether functions are inlined or not is usually
not that important. My experience is that the Gzip.input method is
faster.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-29 19:15   ` Sam Steingold
  2006-08-29 19:48     ` Bárður Árantsson
  2006-08-29 19:54     ` [Caml-list] " Gerd Stolpmann
@ 2006-08-29 20:04     ` Gerd Stolpmann
  2006-08-30  0:44       ` malc
  2 siblings, 1 reply; 12+ messages in thread
From: Gerd Stolpmann @ 2006-08-29 20:04 UTC (permalink / raw)
  To: Sam Steingold; +Cc: Bardur Arantsson, caml-list

Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

Ah yes, and there is an easy solution with ocamlnet:

class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
object(self)
  method input s p l =
    let n = Gzip.input gzip_ch s p l in
    if n = 0 then raise End_of_file;
    n
  method close_in() =
    Gzip.close_in gzip_ch
end


Then use it as follows:

let gz_ch = 
  Netchannels.lift_in (`Rec (new input_gzip gz_in))

let line = gz_ch # input_line()

This adds a buffering layer.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-29 20:04     ` Gerd Stolpmann
@ 2006-08-30  0:44       ` malc
  2006-08-30  0:53         ` Jonathan Roewen
  0 siblings, 1 reply; 12+ messages in thread
From: malc @ 2006-08-30  0:44 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

On Tue, 29 Aug 2006, Gerd Stolpmann wrote:

> Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
>> at any rate, do you really expect that using Gzip.input and then
>> searching the result for a newline, slicing and dicing to get the
>> individual input lines, &c &c would be faster?
>
> Ah yes, and there is an easy solution with ocamlnet:

[..snip..]

> This adds a buffering layer.

The Netchannels buffering looks very elegant, but my (admittedly rather
cursory) testing shows that it's also rather slow.

Following code implements 4 line readers:
Sam's original [char]
Netchannels [net]
open_process_in [zcat]
and buffered (trying to stay compatible with original interface) [block]

While Netchannels do win over original implementation it looses to all
other methods (on my machine).

let buf = Buffer.create 1024
let gz_input_line gz_in char_counter line_counter =
   Buffer.clear buf;
   let finish () = incr line_counter; Buffer.contents buf in
   let rec loop () =
     let ch = Gzip.input_char gz_in in
     char_counter := Int64.succ !char_counter;
     if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); ) in
   try loop ()
   with End_of_file ->
     if Buffer.length buf = 0 then raise End_of_file else finish ()

class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
object(self)
   method input s p l =
     let n = Gzip.input gzip_ch s p l in
     if n = 0 then raise End_of_file;
     n
   method close_in() =
     Gzip.close_in gzip_ch
end

let wrap_gz gz_in =
   let s = String.create 4096 in
   let b = Buffer.create 1024 in
   let r = ref (fun _ _ -> assert false) in
   let findlf s start finish =
     let rec loop pos = if pos >= finish then None
     else if String.unsafe_get s pos = '\n' then Some pos else loop (succ pos)
     in loop start
   in
   let rec cont pos char_counter line_counter =
     let n = Gzip.input gz_in s pos (String.length s - pos) in
     let rec subcont pos len char_counter line_counter =
       let finish = pos + len in
       match findlf s pos finish with
       | None ->
           Buffer.add_substring b s pos len;
           cont 0 char_counter line_counter

       | Some lfpos ->
           let runlen = lfpos - pos in
           incr line_counter;
           Buffer.add_substring b s pos runlen;
           let s = Buffer.contents b in
           Buffer.clear b;
           r := subcont (succ lfpos) (len - succ runlen);
           s
     in
     if n = 0
     then raise End_of_file
     else (
       char_counter := Int64.add (Int64.of_int n) !char_counter;
       subcont pos n char_counter line_counter
      )
   in
   let exec c l = !r c l in
   r := cont 0;
   exec

let char () =
   let gz = Gzip.open_in_chan stdin in
   let cc = ref 0L in
   let lc = ref 0 in
   try
     while true
     do
       let _line = gz_input_line gz cc lc in
       ()
     done
   with End_of_file ->
     Format.printf "cc=%Ld lc=%d@." !cc !lc

let block () =
   let gz = Gzip.open_in_chan stdin in
   let cc = ref 0L in
   let lc = ref 0 in
   let lg = wrap_gz gz in
   try
     while true
     do
       let _line = lg cc lc in
       ()
     done
   with End_of_file ->
     Format.printf "cc=%Ld lc=%d@." !cc !lc

let zcat () =
   let ic = Unix.open_process_in "zcat" in
   let cc = ref 0L in
   let lc = ref 0 in
   try
     while true
     do
       let _line = input_line ic in
       cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
       incr lc
     done
   with End_of_file ->
     Format.printf "cc=%Ld lc=%d@." !cc !lc

let net () =
   let gz_in = Gzip.open_in_chan stdin in
   let gz_ch = Netchannels.lift_in (`Rec (new input_gzip_rec gz_in)) in
   let cc = ref 0L in
   let lc = ref 0 in
   try
     while true
     do
       let _line = gz_ch#input_line () in
       cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
       incr lc
     done
   with End_of_file ->
     Format.printf "cc=%Ld lc=%d@." !cc !lc

let _ =
   match Sys.argv with
   | [| _; "char" |] -> char ()
   | [| _; "zcat" |] -> zcat ()
   | [| _; "block" |] -> block ()
   | [| _; "net" |] -> net ()
   | _ -> prerr_endline (Sys.argv.(0) ^ ": [char|zcat|block|net]")

--
mailto:malc@pulsesoft.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] Re: zcat vs CamlZip
  2006-08-30  0:44       ` malc
@ 2006-08-30  0:53         ` Jonathan Roewen
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Roewen @ 2006-08-30  0:53 UTC (permalink / raw)
  Cc: caml-list

Have you tried Unzip module from Extlib? Haven't tried it, but plan on
using it later on.

Jonathan

On 8/30/06, malc <malc@pulsesoft.com> wrote:
> On Tue, 29 Aug 2006, Gerd Stolpmann wrote:
>
> > Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> >> at any rate, do you really expect that using Gzip.input and then
> >> searching the result for a newline, slicing and dicing to get the
> >> individual input lines, &c &c would be faster?
> >
> > Ah yes, and there is an easy solution with ocamlnet:
>
> [..snip..]
>
> > This adds a buffering layer.
>
> The Netchannels buffering looks very elegant, but my (admittedly rather
> cursory) testing shows that it's also rather slow.
>
> Following code implements 4 line readers:
> Sam's original [char]
> Netchannels [net]
> open_process_in [zcat]
> and buffered (trying to stay compatible with original interface) [block]
>
> While Netchannels do win over original implementation it looses to all
> other methods (on my machine).
>
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
>   Buffer.clear buf;
>   let finish () = incr line_counter; Buffer.contents buf in
>   let rec loop () =
>     let ch = Gzip.input_char gz_in in
>     char_counter := Int64.succ !char_counter;
>     if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); ) in
>   try loop ()
>   with End_of_file ->
>     if Buffer.length buf = 0 then raise End_of_file else finish ()
>
> class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
> object(self)
>   method input s p l =
>     let n = Gzip.input gzip_ch s p l in
>     if n = 0 then raise End_of_file;
>     n
>   method close_in() =
>     Gzip.close_in gzip_ch
> end
>
> let wrap_gz gz_in =
>   let s = String.create 4096 in
>   let b = Buffer.create 1024 in
>   let r = ref (fun _ _ -> assert false) in
>   let findlf s start finish =
>     let rec loop pos = if pos >= finish then None
>     else if String.unsafe_get s pos = '\n' then Some pos else loop (succ pos)
>     in loop start
>   in
>   let rec cont pos char_counter line_counter =
>     let n = Gzip.input gz_in s pos (String.length s - pos) in
>     let rec subcont pos len char_counter line_counter =
>       let finish = pos + len in
>       match findlf s pos finish with
>       | None ->
>           Buffer.add_substring b s pos len;
>           cont 0 char_counter line_counter
>
>       | Some lfpos ->
>           let runlen = lfpos - pos in
>           incr line_counter;
>           Buffer.add_substring b s pos runlen;
>           let s = Buffer.contents b in
>           Buffer.clear b;
>           r := subcont (succ lfpos) (len - succ runlen);
>           s
>     in
>     if n = 0
>     then raise End_of_file
>     else (
>       char_counter := Int64.add (Int64.of_int n) !char_counter;
>       subcont pos n char_counter line_counter
>      )
>   in
>   let exec c l = !r c l in
>   r := cont 0;
>   exec
>
> let char () =
>   let gz = Gzip.open_in_chan stdin in
>   let cc = ref 0L in
>   let lc = ref 0 in
>   try
>     while true
>     do
>       let _line = gz_input_line gz cc lc in
>       ()
>     done
>   with End_of_file ->
>     Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let block () =
>   let gz = Gzip.open_in_chan stdin in
>   let cc = ref 0L in
>   let lc = ref 0 in
>   let lg = wrap_gz gz in
>   try
>     while true
>     do
>       let _line = lg cc lc in
>       ()
>     done
>   with End_of_file ->
>     Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let zcat () =
>   let ic = Unix.open_process_in "zcat" in
>   let cc = ref 0L in
>   let lc = ref 0 in
>   try
>     while true
>     do
>       let _line = input_line ic in
>       cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
>       incr lc
>     done
>   with End_of_file ->
>     Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let net () =
>   let gz_in = Gzip.open_in_chan stdin in
>   let gz_ch = Netchannels.lift_in (`Rec (new input_gzip_rec gz_in)) in
>   let cc = ref 0L in
>   let lc = ref 0 in
>   try
>     while true
>     do
>       let _line = gz_ch#input_line () in
>       cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
>       incr lc
>     done
>   with End_of_file ->
>     Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let _ =
>   match Sys.argv with
>   | [| _; "char" |] -> char ()
>   | [| _; "zcat" |] -> zcat ()
>   | [| _; "block" |] -> block ()
>   | [| _; "net" |] -> net ()
>   | _ -> prerr_endline (Sys.argv.(0) ^ ": [char|zcat|block|net]")
>
> --
> mailto:malc@pulsesoft.com
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] zcat vs CamlZip
  2006-08-29 18:40 zcat vs CamlZip Sam Steingold
  2006-08-29 18:54 ` Bardur Arantsson
  2006-08-29 19:11 ` [Caml-list] " Eric Cooper
@ 2006-08-30  6:12 ` Jeff Henrikson
  2 siblings, 0 replies; 12+ messages in thread
From: Jeff Henrikson @ 2006-08-30  6:12 UTC (permalink / raw)
  To: Sam Steingold; +Cc: caml-list

I was planning on using the library "ocaml gz" in my application, which 
is a binding to zlib.  I haven't done any detailed benchmarking, but I 
presume its speed is comparable to gzip/gunzip since they just call out 
to zlib.

http://ocamlplot.sourceforge.net/


Jeff Henrikson



Sam Steingold wrote:

> I read through a huge *.gz file.
> I have two versions of the code:
>
> 1. use Unix.open_process_in "zcat foo.gz".
>
> 2. use gzip.mli (1.2 2002/02/18) as comes with godi 3.09.
>
> it turns out that the zcat version is 3(!) times as fast as the 
> gzip.mli one:
>
> Run time: 189.435840 sec
> Self:     189.435840 sec
>      sys: 183.447465 sec
>     user: 5.988375 sec
> Children: 0.000000 sec
>      sys: 0.000000 sec
>     user: 0.000000 sec
> GC:     minor: 169778
>         major: 478
>   compactions: 3
> Allocated:  5510457762.0 words
> Wall clock:  206 sec (00:03:26)
>
> vs
>
> Run time: 58.471655 sec
> Self:     54.855429 sec
>      sys: 48.527033 sec
>     user: 6.328396 sec
> Children: 3.616226 sec
>      sys: 3.168198 sec
>     user: 0.448028 sec
> GC:     minor: 43174
>         major: 229
>   compactions: 5
> Allocated:  1401290543.0 words
> Wall clock:  78 sec (00:01:18)
>
> since gzip.mli lacks input_line function, I had to roll my own:
>
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
>   Buffer.clear buf;
>   let finish () = incr line_counter; Buffer.contents buf in
>   let rec loop () =
>     let ch = Gzip.input_char gz_in in
>     char_counter := Int64.succ !char_counter;
>     if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop 
> (); ) in
>   try loop ()
>   with End_of_file ->
>     if Buffer.length buf = 0 then raise End_of_file else finish ()
>
> is there something wrong with my gz_input_line?
> is this a know performance issue with the CamlZip library?
>
> thanks.
> Sam.
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-08-30  6:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-29 18:40 zcat vs CamlZip Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01   ` [Caml-list] " Florian Hars
2006-08-29 19:15   ` Sam Steingold
2006-08-29 19:48     ` Bárður Árantsson
2006-08-29 19:54     ` [Caml-list] " Gerd Stolpmann
2006-08-29 20:04     ` Gerd Stolpmann
2006-08-30  0:44       ` malc
2006-08-30  0:53         ` Jonathan Roewen
2006-08-29 19:37   ` John Carr
2006-08-29 19:11 ` [Caml-list] " Eric Cooper
2006-08-30  6:12 ` Jeff Henrikson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).