From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sds@podval.org>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id 0B5CDBC29
	for <caml-list@yquem.inria.fr>; Tue, 29 Aug 2006 21:15:54 +0200 (CEST)
Received: from janestcapital.com ([66.155.124.107])
	by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k7TJFqA6004515
	for <caml-list@inria.fr>; Tue, 29 Aug 2006 21:15:53 +0200
Received: from [192.168.250.217] [209.213.205.130] by janestcapital.com with ESMTP
  (SMTPD32-8.13) id A26849100B0; Tue, 29 Aug 2006 15:15:52 -0400
Message-ID: <44F49267.9080904@podval.org>
Disposition-Notification-To: Sam Steingold <sds@podval.org>
Date: Tue, 29 Aug 2006 15:15:51 -0400
From: Sam Steingold <sds@podval.org>
User-Agent: Thunderbird 1.5.0.5 (X11/20060808)
MIME-Version: 1.0
To: Bardur Arantsson <spam@scientician.net>, caml-list@inria.fr
Subject: Re: zcat vs CamlZip
References: <44F48A17.5080005@podval.org> <ed22gp$un$1@sea.gmane.org>
In-Reply-To: <ed22gp$un$1@sea.gmane.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Miltered: at nez-perce with ID 44F49269.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; camlzip:01 buf:01 buffer:01 buffer:01 buf:01 inlined:01 ocaml:01 wrote:01 wrote:01 rec:01 char:01 char:01 newline:02 module:03 module:03 

Bardur Arantsson wrote:
> Sam Steingold wrote:
>> I read through a huge *.gz file.
>> I have two versions of the code:
> [--snip--]
>>
>> let buf = Buffer.create 1024
>> let gz_input_line gz_in char_counter line_counter =
>>   Buffer.clear buf;
>>   let finish () = incr line_counter; Buffer.contents buf in
>>   let rec loop () =
>>     let ch = Gzip.input_char gz_in in
> 
> This is your most likely culprit. Any kind of "do this for every 
> character" is usually insanely expensive when you can do it in bulk.
> (This is especially true when needing to do system calls, or if the 
> called function cannot be inlined.)
> 

yes, I thought about it, but I assumed that the ocaml gzip module 
inlines  Gzip.input_char (obviously the gzip module needs an internal 
cache so Gzip.input_char does not _always_ translate to a system call, 
most of the time it just pops a char from the internal buffer).
at any rate, do you really expect that using Gzip.input and then 
searching the result for a newline, slicing and dicing to get the 
individual input lines, &c &c would be faster?

Sam.