From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=AWL,MAILTO_TO_REMOVE,SPF_FAIL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 841BCBC29 for ; Tue, 29 Aug 2006 21:49:14 +0200 (CEST) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id k7TJnDXG003780 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Tue, 29 Aug 2006 21:49:14 +0200 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GI9a7-0005zt-5I for caml-list@inria.fr; Tue, 29 Aug 2006 21:49:08 +0200 Received: from 0x5731d08c.bynxx18.adsl-dhcp.tele.dk ([87.49.208.140]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 29 Aug 2006 21:49:07 +0200 Received: from spam by 0x5731d08c.bynxx18.adsl-dhcp.tele.dk with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 29 Aug 2006 21:49:07 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: caml-list@inria.fr From: =?UTF-8?B?QsOhcsOwdXIgw4FyYW50c3Nvbg==?= Subject: Re: zcat vs CamlZip Date: Tue, 29 Aug 2006 21:48:52 +0200 Message-ID: References: <44F48A17.5080005@podval.org> <44F49267.9080904@podval.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: 0x5731d08c.bynxx18.adsl-dhcp.tele.dk User-Agent: Thunderbird 1.5.0.5 (X11/20060729) In-Reply-To: <44F49267.9080904@podval.org> Sender: news X-Miltered: at concorde with ID 44F49A39.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; camlzip:01 buf:01 buffer:01 buffer:01 buf:01 inlined:01 ocaml:01 contrasted:01 syscalls:01 inlining:01 ocaml:01 inlined:01 inlining:01 wrote:01 wrote:01 Sam Steingold wrote: > Bardur Arantsson wrote: >> Sam Steingold wrote: >>> I read through a huge *.gz file. >>> I have two versions of the code: >> [--snip--] >>> >>> let buf = Buffer.create 1024 >>> let gz_input_line gz_in char_counter line_counter = >>> Buffer.clear buf; >>> let finish () = incr line_counter; Buffer.contents buf in >>> let rec loop () = >>> let ch = Gzip.input_char gz_in in >> >> This is your most likely culprit. Any kind of "do this for every >> character" is usually insanely expensive when you can do it in bulk. >> (This is especially true when needing to do system calls, or if the >> called function cannot be inlined.) >> > > yes, I thought about it, but I assumed that the ocaml gzip module > inlines Gzip.input_char (obviously the gzip module needs an internal > cache so Gzip.input_char does not _always_ translate to a system call, > most of the time it just pops a char from the internal buffer). You can also easily try this in C with fgetc() contrasted with fgets(). The difference is _huge_ even if they both do comparable numbers of syscalls -- assuming that the buffering is identical (I haven't checked, but I think it is a reasonable assumption). In the C case, the inlining is not really guaranteed, but I don't think it is in OCaml either -- though I honestly don't know. You'd have to check the assembler output to see if the call gets inlined. Inlining aside, memory prefecthing probably also makes a difference in favor of reading in bulk and then processing "in bulk". > at any rate, do you really expect that using Gzip.input and then > searching the result for a newline, slicing and dicing to get the > individual input lines, &c &c would be faster? I would guess so, yes. (There may of course be other reasons for a large portion of the difference as others have pointed out.) -- Bardur Arantsson - 'Blackmail' is such an ugly word. I prefer 'extortion'. The X makes it sound cool. Bender, 'Futurama'