From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 0A252BC29 for ; Tue, 29 Aug 2006 20:40:33 +0200 (CEST) Received: from janestcapital.com ([66.155.124.107]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k7TIePhS031181 for ; Tue, 29 Aug 2006 20:40:32 +0200 Received: from [192.168.250.217] [209.213.205.130] by janestcapital.com with ESMTP (SMTPD32-8.13) id AA18AB900A4; Tue, 29 Aug 2006 14:40:24 -0400 Message-ID: <44F48A17.5080005@podval.org> Disposition-Notification-To: Sam Steingold Date: Tue, 29 Aug 2006 14:40:23 -0400 From: Sam Steingold User-Agent: Thunderbird 1.5.0.5 (X11/20060808) MIME-Version: 1.0 To: caml-list@inria.fr Subject: zcat vs CamlZip Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Miltered: at nez-perce with ID 44F48A19.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; camlzip:01 unix:01 foo:01 mli:01 mli:01 compactions:01 compactions:01 lacks:01 buf:01 buffer:01 buffer:01 buf:01 camlzip:01 1.2:98 rec:01 I read through a huge *.gz file. I have two versions of the code: 1. use Unix.open_process_in "zcat foo.gz". 2. use gzip.mli (1.2 2002/02/18) as comes with godi 3.09. it turns out that the zcat version is 3(!) times as fast as the gzip.mli one: Run time: 189.435840 sec Self: 189.435840 sec sys: 183.447465 sec user: 5.988375 sec Children: 0.000000 sec sys: 0.000000 sec user: 0.000000 sec GC: minor: 169778 major: 478 compactions: 3 Allocated: 5510457762.0 words Wall clock: 206 sec (00:03:26) vs Run time: 58.471655 sec Self: 54.855429 sec sys: 48.527033 sec user: 6.328396 sec Children: 3.616226 sec sys: 3.168198 sec user: 0.448028 sec GC: minor: 43174 major: 229 compactions: 5 Allocated: 1401290543.0 words Wall clock: 78 sec (00:01:18) since gzip.mli lacks input_line function, I had to roll my own: let buf = Buffer.create 1024 let gz_input_line gz_in char_counter line_counter = Buffer.clear buf; let finish () = incr line_counter; Buffer.contents buf in let rec loop () = let ch = Gzip.input_char gz_in in char_counter := Int64.succ !char_counter; if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); ) in try loop () with End_of_file -> if Buffer.length buf = 0 then raise End_of_file else finish () is there something wrong with my gz_input_line? is this a know performance issue with the CamlZip library? thanks. Sam.