From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <info@gerd-stolpmann.de>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 578F1BC29
	for <caml-list@yquem.inria.fr>; Tue, 29 Aug 2006 21:54:48 +0200 (CEST)
Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.177])
	by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id k7TJsl2G004705
	for <caml-list@inria.fr>; Tue, 29 Aug 2006 21:54:48 +0200
Received: from [84.58.144.101] (helo=gate.lan.gerd-stolpmann.de)
	by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis),
	id 0MKwh2-1GI9fR2ABU-0003r6; Tue, 29 Aug 2006 21:54:37 +0200
Received: from flakew.lan.gerd-stolpmann.de (fw.lan.gerd-stolpmann.de [192.168.1.1])
	by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id 186B5C08C;
	Tue, 29 Aug 2006 21:54:37 +0200 (CEST)
Subject: Re: [Caml-list] Re: zcat vs CamlZip
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Sam Steingold <sds@podval.org>
Cc: Bardur Arantsson <spam@scientician.net>, caml-list@inria.fr
In-Reply-To: <44F49267.9080904@podval.org>
References: <44F48A17.5080005@podval.org> <ed22gp$un$1@sea.gmane.org>
	 <44F49267.9080904@podval.org>
Content-Type: text/plain
Date: Tue, 29 Aug 2006 21:54:34 +0200
Message-Id: <1156881275.21883.116.camel@localhost.localdomain>
Mime-Version: 1.0
X-Mailer: Evolution 2.4.1 
Content-Transfer-Encoding: 7bit
X-Provags-ID: kundenserver.de abuse@kundenserver.de login:a6865a839c0178d9aa0ce41878507ea2
X-Miltered: at concorde with ID 44F49B87.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; camlzip:01 gerd:01 stolpmann:01 buf:01 buffer:01 buffer:01 buf:01 inlined:01 ocaml:01 cmx:01 inlining:01 cmx:01 inlined:01 foo:01 gerd:01 

Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> Bardur Arantsson wrote:
> > Sam Steingold wrote:
> >> I read through a huge *.gz file.
> >> I have two versions of the code:
> > [--snip--]
> >>
> >> let buf = Buffer.create 1024
> >> let gz_input_line gz_in char_counter line_counter =
> >>   Buffer.clear buf;
> >>   let finish () = incr line_counter; Buffer.contents buf in
> >>   let rec loop () =
> >>     let ch = Gzip.input_char gz_in in
> > 
> > This is your most likely culprit. Any kind of "do this for every 
> > character" is usually insanely expensive when you can do it in bulk.
> > (This is especially true when needing to do system calls, or if the 
> > called function cannot be inlined.)
> > 
> 
> yes, I thought about it, but I assumed that the ocaml gzip module 
> inlines  Gzip.input_char (obviously the gzip module needs an internal 
> cache so Gzip.input_char does not _always_ translate to a system call, 
> most of the time it just pops a char from the internal buffer).

This may be a godi issue, because gzip.cmx is not installed. Inlining
needs the .cmx file. However, I am not sure whether input_char can be
inlined at all. You can find that out with the dumpapprox tool:

dumpapprox path/to/foo.cmx

Look for the "Approximation" section. If the function (or better entry
point) is listed with the "(inline)" flag it can be inlined, otherwise
not.

> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

The question is whether you finally get a loop that can be completely
executed in the CPU's cache, and how many variables need to be read and
written in a loop cycle. Whether functions are inlined or not is usually
not that important. My experience is that the Gzip.input method is
faster.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------