caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Bárður Árantsson" <spam@scientician.net>
To: caml-list@inria.fr
Subject: Re: zcat vs CamlZip
Date: Tue, 29 Aug 2006 21:48:52 +0200	[thread overview]
Message-ID: <ed25n4$enm$1@sea.gmane.org> (raw)
In-Reply-To: <44F49267.9080904@podval.org>

Sam Steingold wrote:
> Bardur Arantsson wrote:
>> Sam Steingold wrote:
>>> I read through a huge *.gz file.
>>> I have two versions of the code:
>> [--snip--]
>>>
>>> let buf = Buffer.create 1024
>>> let gz_input_line gz_in char_counter line_counter =
>>>   Buffer.clear buf;
>>>   let finish () = incr line_counter; Buffer.contents buf in
>>>   let rec loop () =
>>>     let ch = Gzip.input_char gz_in in
>>
>> This is your most likely culprit. Any kind of "do this for every 
>> character" is usually insanely expensive when you can do it in bulk.
>> (This is especially true when needing to do system calls, or if the 
>> called function cannot be inlined.)
>>
> 
> yes, I thought about it, but I assumed that the ocaml gzip module 
> inlines  Gzip.input_char (obviously the gzip module needs an internal 
> cache so Gzip.input_char does not _always_ translate to a system call, 
> most of the time it just pops a char from the internal buffer).

You can also easily try this in C with fgetc() contrasted with fgets(). 
The difference is _huge_ even if they both do comparable numbers of 
syscalls -- assuming that the buffering is identical (I haven't checked, 
but I think it is a reasonable assumption). In the C case, the inlining 
is not really guaranteed, but I don't think it is in OCaml either -- 
though I honestly don't know. You'd have to check the assembler output 
to see if the call gets inlined.

Inlining aside, memory prefecthing probably also makes a difference in 
favor of reading in bulk and then processing "in bulk".

> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

I would guess so, yes.

(There may of course be other reasons for a large portion of the 
difference as others have pointed out.)

-- 
Bardur Arantsson
<bardurREMOVE@THISscientician.net>

- 'Blackmail' is such an ugly word. I prefer 'extortion'. The X
makes it sound cool.
                                                Bender, 'Futurama'


  reply	other threads:[~2006-08-29 19:49 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-29 18:40 Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01   ` [Caml-list] " Florian Hars
2006-08-29 19:15   ` Sam Steingold
2006-08-29 19:48     ` Bárður Árantsson [this message]
2006-08-29 19:54     ` [Caml-list] " Gerd Stolpmann
2006-08-29 20:04     ` Gerd Stolpmann
2006-08-30  0:44       ` malc
2006-08-30  0:53         ` Jonathan Roewen
2006-08-29 19:37   ` John Carr
2006-08-29 19:11 ` [Caml-list] " Eric Cooper
2006-08-30  6:12 ` Jeff Henrikson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ed25n4$enm$1@sea.gmane.org' \
    --to=spam@scientician.net \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).