From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kirillkh@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: *
X-Spam-Status: No, score=1.6 required=5.0 tests=HTML_MESSAGE,MISSING_HEADERS,
	SPF_NEUTRAL autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id 5587ABC69
	for <caml-list@yquem.inria.fr>; Tue,  2 Oct 2007 20:02:40 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgAAAHgkAkfRVZKzmGdsb2JhbACCcYtEAQEBAQcEBCk
X-IronPort-AV: E=Sophos;i="4.21,221,1188770400"; 
   d="scan'208";a="17204498"
Received: from wa-out-1112.google.com ([209.85.146.179])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 02 Oct 2007 20:02:39 +0200
Received: by wa-out-1112.google.com with SMTP id k17so5234750waf
        for <caml-list@yquem.inria.fr>; Tue, 02 Oct 2007 11:02:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=beta;
        h=domainkey-signature:received:received:message-id:date:from:subject:cc:in-reply-to:mime-version:content-type:references;
        bh=X1mUbH9jYrCrrBhXyhZf9m1gwGHsKEdzEd2/OHl2cxc=;
        b=Z9xDcb+l7+2HwqWtnfQRq+D6Zr8uLsiWIVFrXTk8rbQlx24DbyTvpYnID7yh3MXpizPyGa10kBZJp/qo0r33fUpaVeMN9zLvEz9GIDAyctlbOmrwAOHty1eN4jxy27adLNBjjklfyP98pnXBZSFZ91Fo45urc/jWCbgxRGl9WFk=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:subject:cc:in-reply-to:mime-version:content-type:references;
        b=c83SAk4wbvAZEXd/vx8NmfEq01P/6IOP2vBndEXSUus0XkIxQD+jQJjcgvqpK2mR2MVbct/ByXR916FxDWakemi/+ZhMWTNs2o9gPRkcLzeq3FNi36cBlPmkNE9J8VaievLVUoPIeAV6FrGiGTpjKpA1/RDTfrilbuWLzWqnAww=
Received: by 10.114.125.2 with SMTP id x2mr2679425wac.1191348157861;
        Tue, 02 Oct 2007 11:02:37 -0700 (PDT)
Received: by 10.114.25.8 with HTTP; Tue, 2 Oct 2007 11:02:37 -0700 (PDT)
Message-ID: <e2d02be30710021102h565c7e2ax88b03a4066812816@mail.gmail.com>
Date: Tue, 2 Oct 2007 20:02:37 +0200
From: kirillkh <kirillkh@gmail.com>
Subject: Re: [Caml-list] best and fastest way to read lines from a file?
Cc: caml-list@yquem.inria.fr
In-Reply-To: <e2d02be30710020915y2f115359j3151c8d36aa2c3e8@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_19623_10709104.1191348157853"
References: <779bf2730710011427g5983da4cw6ad8b715a9e38771@mail.gmail.com>
	 <47016CEE.8010704@crans.org> <200710021239.l92CdwZ15641@virtutech.se>
	 <47024002.2080206@janestcapital.com>
	 <e2d02be30710020915y2f115359j3151c8d36aa2c3e8@mail.gmail.com>
X-Spam: no; 0.00; iterating:01 combinator:01 invokes:01 combinator:01 byte:01 ocaml:01 trivial:01 trivial:01 moderately:01 ocaml:01 parser:01 pitfalls:01 mattias:01 channel-:01 'a-:01 

------=_Part_19623_10709104.1191348157853
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Replying to a private mail from Brian:

2007/10/2, Brian Hurt <bhurt@janestcapital.com>:
>
> kirillkh wrote:
>
>
>  This should be a FAQ.
> >
>
> Since we're talking of 10+ lines of code and only one case among many
> possible (you might also want to do something fairly similar, but not quite
> the same, as iterating over all words or characters in a file, doing
> something else than counting, etc.), I would rather see it implemented in a
> library as combinator. What I have in mind is a function that goes over a
> file and invokes some user code on each block of
> bytes/characters/lines/words/... The points of customization would be:
> * how to detect the start and end of block
> * routine to pass the blocks to
>
> Then, on top of this combinator, build block-specific ones: for byte,
> char, line, word blocks. Also make it possible to customize buffering
> behavior.
>
> Being new to OCaml, I'm interested in comments - is what I suggest a good
> idea? If yes, why hasn't anyone implemented it yet? One could argue that
> it's trivial and can be implemented in each use case anew, but among 5
> different solutions posted so far, each has its own problems, besides the
> supposedly ideal 5-th -- I'd take this as an indication that, although
> simple, it's not really trivial to write this thing.
>
> Note that the fifth version is only perfect because the author knew that
> he didn't need to keep the line around.  Doing something even moderately
> more complicated would almost certainly require an allocation in the main
> loop- not that this is that big of a problem (Ocaml allocation is lightning
> fast).
>
> The problem with this library is when to stop.  A real simple "fold over
> the lines of a file" interface might be nice, but there'll always be
> pressure to add just one more feature, deal with just one more slightly more
> complex case- at the end of which you get a badly specified and badly
> implemented parser combinator library.
>
> Brian
>

OK, so I'll give up the parsing/buffering part and only leave efficient
exception handling. This should leave the user free to do anything with it,
but prevent performance pitfalls. The following is based on Mattias
Engdegard's code:

(* I couldn't figure out, how to declare a polymorphic exception properly *)
exception Done of 'a

let fold_file (file: in_channel)
              (read_func: in_channel->'a)
              (elem_func: 'a->'b->'b)
              (seed: 'b) =
  let rec loop prev_val =
    let input =
      try read_func file
      with End_of_file -> raise (Done prev_val)
    in
      let combined_val = elem_func input prev_val in
      loop combined_val
  in
    try loop seed with Done x -> x

And the usage for line counting:

let line_count filename =
   let f = open_in filename in
   let counter _ count = count + 1 in
   fold_file f readline counter 0

Since it's library code, we can tolerate the little annoyance of the second
try-catch. As far as I can tell, this code has the same performance
characteristics as yours: no consing + tail recursion. Any other problems
with it?

------=_Part_19623_10709104.1191348157853
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Replying to a private mail from Brian:<br><br><div><span class="gmail_quote">2007/10/2, Brian Hurt &lt;<a href="mailto:bhurt@janestcapital.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">bhurt@janestcapital.com
</a>&gt;:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><span class="q" id="q_11561d7cfaa92b59_1"><div bgcolor="#ffffff" text="#000000">
<div><span>kirillkh wrote:<br>
<blockquote cite="http://mide2d02be30710020915y2f115359j3151c8d36aa2c3e8@mail.gmail.com" type="cite"><br>
  <div><span class="gmail_quote"></span>
  <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
    <div bgcolor="#ffffff" text="#000000">This should be a FAQ.
    <span><br>
    </span></div>
  </blockquote>
  </div>
  <br>
Since we&#39;re talking of 10+ lines of code and only one case among many
possible (you might also want to do something fairly similar, but not
quite the same, as iterating over all words or characters in a file,
doing something else than counting, etc.), I would rather see it
implemented in a library as combinator. What I have in mind is a
function that goes over a file and invokes some user code on each block
of bytes/characters/lines/words/... The points of customization would
be:<br>
* how to detect the start and end of block<br>
* routine to pass the blocks to<br>
  <br>
Then, on top of this combinator, build block-specific ones: for byte,
char, line, word blocks. Also make it possible to customize buffering
behavior.<br>
  <br>
Being new to OCaml, I&#39;m interested in comments - is what I suggest a
good idea? If yes, why hasn&#39;t anyone implemented it yet? One could
argue that it&#39;s trivial and can be implemented in each use case anew,
but among 5 different solutions posted so far, each has its own
problems, besides the supposedly ideal 5-th -- I&#39;d take this as an
indication that, although simple, it&#39;s not really trivial to write this
thing.<br>
</blockquote></span></div>
Note that the fifth version is only perfect because the author knew
that he didn&#39;t need to keep the line around.&nbsp; Doing something even
moderately more complicated would almost certainly require an
allocation in the main loop- not that this is that big of a problem
(Ocaml allocation is lightning fast).<br>
<br>
The problem with this library is when to stop.&nbsp; A real simple &quot;fold
over the lines of a file&quot; interface might be nice, but there&#39;ll always
be pressure to add just one more feature, deal with just one more
slightly more complex case- at the end of which you get a badly
specified and badly implemented parser combinator library.<br>
<br>
Brian<br>
</div></span></div>

</blockquote></div><br>OK, so I&#39;ll give up the parsing/buffering part
and only leave efficient exception handling. This should leave the user
free to do anything with it, but prevent performance pitfalls. The
following is based on Mattias Engdegard&#39;s code:
<br><br><div style="margin-left: 40px;">(* I couldn&#39;t figure out, how to declare a polymorphic exception properly *)<br>exception Done of &#39;a<br></div><div style="margin-left: 40px;"><br></div><div style="margin-left: 40px;">

let fold_file (file: in_channel)<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; (read_func: in_channel-&gt;&#39;a)<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; (elem_func: &#39;a-&gt;&#39;b-&gt;&#39;b)<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; (seed: &#39;b) =<br>&nbsp; let rec loop prev_val =<br>&nbsp;&nbsp;&nbsp; let input =
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try read_func file<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with End_of_file -&gt; raise (Done prev_val)<br>&nbsp;&nbsp;&nbsp; in<br>&nbsp;&nbsp;&nbsp; &nbsp; let combined_val = elem_func input prev_val in<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; loop combined_val<br>&nbsp; in<br>&nbsp;&nbsp;&nbsp; try loop seed with Done x -&gt; x
<br></div><br>And the usage for line counting:<br><br><div style="margin-left: 40px;"><span class="q">let line_count filename =<br>&nbsp;&nbsp; let f = open_in filename in<br></span>&nbsp;&nbsp; let counter _ count = count + 1 in<br>&nbsp;&nbsp; fold_file f readline counter 0
<br></div><br>Since it&#39;s library code, we can tolerate the little
annoyance of the second try-catch. As far as I can tell, this code has
the same performance characteristics as yours: no consing + tail
recursion. Any other problems with it?

------=_Part_19623_10709104.1191348157853--