From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 578F1BC29 for ; Tue, 29 Aug 2006 21:54:48 +0200 (CEST) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.177]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id k7TJsl2G004705 for ; Tue, 29 Aug 2006 21:54:48 +0200 Received: from [84.58.144.101] (helo=gate.lan.gerd-stolpmann.de) by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis), id 0MKwh2-1GI9fR2ABU-0003r6; Tue, 29 Aug 2006 21:54:37 +0200 Received: from flakew.lan.gerd-stolpmann.de (fw.lan.gerd-stolpmann.de [192.168.1.1]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id 186B5C08C; Tue, 29 Aug 2006 21:54:37 +0200 (CEST) Subject: Re: [Caml-list] Re: zcat vs CamlZip From: Gerd Stolpmann To: Sam Steingold Cc: Bardur Arantsson , caml-list@inria.fr In-Reply-To: <44F49267.9080904@podval.org> References: <44F48A17.5080005@podval.org> <44F49267.9080904@podval.org> Content-Type: text/plain Date: Tue, 29 Aug 2006 21:54:34 +0200 Message-Id: <1156881275.21883.116.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de login:a6865a839c0178d9aa0ce41878507ea2 X-Miltered: at concorde with ID 44F49B87.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; camlzip:01 gerd:01 stolpmann:01 buf:01 buffer:01 buffer:01 buf:01 inlined:01 ocaml:01 cmx:01 inlining:01 cmx:01 inlined:01 foo:01 gerd:01 Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold: > Bardur Arantsson wrote: > > Sam Steingold wrote: > >> I read through a huge *.gz file. > >> I have two versions of the code: > > [--snip--] > >> > >> let buf = Buffer.create 1024 > >> let gz_input_line gz_in char_counter line_counter = > >> Buffer.clear buf; > >> let finish () = incr line_counter; Buffer.contents buf in > >> let rec loop () = > >> let ch = Gzip.input_char gz_in in > > > > This is your most likely culprit. Any kind of "do this for every > > character" is usually insanely expensive when you can do it in bulk. > > (This is especially true when needing to do system calls, or if the > > called function cannot be inlined.) > > > > yes, I thought about it, but I assumed that the ocaml gzip module > inlines Gzip.input_char (obviously the gzip module needs an internal > cache so Gzip.input_char does not _always_ translate to a system call, > most of the time it just pops a char from the internal buffer). This may be a godi issue, because gzip.cmx is not installed. Inlining needs the .cmx file. However, I am not sure whether input_char can be inlined at all. You can find that out with the dumpapprox tool: dumpapprox path/to/foo.cmx Look for the "Approximation" section. If the function (or better entry point) is listed with the "(inline)" flag it can be inlined, otherwise not. > at any rate, do you really expect that using Gzip.input and then > searching the result for a newline, slicing and dicing to get the > individual input lines, &c &c would be faster? The question is whether you finally get a loop that can be completely executed in the CPU's cache, and how many variables need to be read and written in a loop cycle. Whether functions are inlined or not is usually not that important. My experience is that the Gzip.input method is faster. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------