From mboxrd@z Thu Jan  1 00:00:00 1970
X-Sympa-To: caml-list@inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2J98iWU019290
	for <caml-list@sympa-roc.inria.fr>; Mon, 19 Mar 2012 10:08:44 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApIBANX2Zk/RVdQ2kGdsb2JhbABDtkEIIgEBAQEJCQ0HFAQjggkBAQEDAQwGAiwBGx0BAwELBgUEBwMKLiEBAREBBQEcBhMbB4djBZxRCowSgnGENj+IdAEFC4lKX4ZIBJIxgzeLL4MaPYQJgVs
X-IronPort-AV: E=Sophos;i="4.73,611,1325458800"; 
   d="scan'208";a="136706487"
Received: from mail-vb0-f54.google.com ([209.85.212.54])
  by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 19 Mar 2012 10:08:38 +0100
Received: by vbmv11 with SMTP id v11so1099237vbm.27
        for <caml-list@inria.fr>; Mon, 19 Mar 2012 02:08:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        bh=dbhAveBAw8C1KOUVuwqbM0QpySjwUkQwvFP5Cp85zt0=;
        b=Sr6qeN6iaJgq7sUL7p2MJ6RHRHOKRO739BFDIbYEbKCzpRG8YJhkAE14hKE1JRJTQt
         hvOb5rZ0iXJ/dGIiaS1krla5hU0QYCB1Y2Ejef8e4wYLQFDBSYK0fcAG6Lae7lYLIeKm
         NoRUFbp6AUr3zWmp4zw0wU8btAe0pBfBaeh7e+HJu/ORYpWkc6sWhrzDDNh5b0U0f4Yd
         5bOXD7b5Z6P8+ULrolR//el3ILoXUKJxyx8RxDTAs7fVLs4h61QrBvvMjY3eo6oVefH4
         Wg7akJJqDpVKUetyguK3VOvfzLFnkq0gr/VlUYLZk+S1O8QUtwJfvfag6DjNq0eUIrdE
         0zrw==
Received: by 10.52.27.11 with SMTP id p11mr5014022vdg.37.1332148117391; Mon,
 19 Mar 2012 02:08:37 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.52.31.136 with HTTP; Mon, 19 Mar 2012 02:08:17 -0700 (PDT)
In-Reply-To: <CAL-jcAm5dfS0CEB=rCry=auP9FeyVM5Cy3PkNhoT4rX2AXX5rg@mail.gmail.com>
References: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com>
 <4F634AD1.20402@gmail.com> <CAOOOohRVtAkiKGC=9GQNUEt=7d=z6kSOU=uDEc2ALZgFepWABg@mail.gmail.com>
 <CAL-jcAm5dfS0CEB=rCry=auP9FeyVM5Cy3PkNhoT4rX2AXX5rg@mail.gmail.com>
From: Philippe Veber <philippe.veber@gmail.com>
Date: Mon, 19 Mar 2012 10:08:17 +0100
Message-ID: <CAOOOohQ3e3LXk_qMmm__0XeWADjZAGqWmpS7UFGNyt-6bFeY3A@mail.gmail.com>
To: Edgar Friendly <thelema314@gmail.com>
Cc: "caml-list@inria.fr" <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=20cf307d0102534a8104bb94e733
X-Validation-by: philippe.veber@gmail.com
Subject: Re: [Caml-list] Efficient scanning of large strings from files


--20cf307d0102534a8104bb94e733
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I ju=
st
hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

2012/3/16 Edgar Friendly <thelema314@gmail.com>

> So given a large file and a line number, you want to:
> 1) extract that line from the file
> 2) produce an enum of all k-length slices of that line?
> 3) match each slice against your regexp set to produce a list/enum of
> substrings that match the regexps?
> Without reading the whole line into memory at once.
>
> I'm with Dimino on the right solution - just use a matcher that that works
> incrementally, feed it one byte at a time, and have it return a list of
> match offsets.  Then work backwards from these endpoints to figure out
> which substrings you want.
>
> There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
> should suffice to use (0,k-1) and (k,2k-1) with an incremental matching
> routine.
>
> E.
>
>
>
> On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com
> > wrote:
>
>> Thank you Edgar for your answer (and also Christophe). It seems my
>> question was a bit misleading: actually I target a subset of regexps who=
se
>> matching is really trivial, so this is no worry for me. I was more
>> interested in how accessing a large line in a file by chunks of fixed
>> length k. For instance how to build a [Substring.t Enum.t] from some line
>> in a file, without building the whole line in memory. This enum would yi=
eld
>> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many
>> string copy/concat operations. I think I can do it myself but I'm not too
>> confident regarding good practices on buffered reads of files. Maybe the=
re
>> are some good examples in Batteries?
>>
>> Thanks again,
>>   ph.
>>
>>
>>
>

--20cf307d0102534a8104bb94e733
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I ju=
st hope that a repeated use of input_char is not 10-100X slower than input_=
line :o).<br>ph.<br><br><div class=3D"gmail_quote">2012/3/16 Edgar Friendly=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:thelema314@gmail.com">thelema314@g=
mail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">So given a large file and a line number, you=
 want to:<br>1) extract that line from the file<br>2) produce an enum of al=
l k-length slices of that line?<br>

3) match each slice against your regexp set to produce a list/enum of subst=
rings that match the regexps?<br>
Without reading the whole line into memory at once.=A0 <br><br>I&#39;m with=
 Dimino on the right solution - just use a matcher that that works incremen=
tally, feed it one byte at a time, and have it return a list of match offse=
ts.=A0 Then work backwards from these endpoints to figure out which substri=
ngs you want.<br>


<br>There shouldn&#39;t be a reason to use substrings (0,k-1) and (1,k) - i=
t should suffice to use (0,k-1) and (k,2k-1) with an incremental matching r=
outine.<span class=3D"HOEnZb"><font color=3D"#888888"><br><br>E.</font></sp=
an><div class=3D"HOEnZb">

<div class=3D"h5"><br><br><br><div class=3D"gmail_quote">On Fri, Mar 16, 20=
12 at 10:48 AM, Philippe Veber <span dir=3D"ltr">&lt;<a href=3D"mailto:phil=
ippe.veber@gmail.com" target=3D"_blank">philippe.veber@gmail.com</a>&gt;</s=
pan> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Thank you Edgar for your answer (and also Ch=
ristophe). It seems my question was a bit misleading: actually I target a s=
ubset of regexps whose matching is really trivial, so this is no worry for =
me. I was more interested in how accessing a large line in a file by chunks=
 of fixed length k. For instance how to build a [Substring.t Enum.t] from s=
ome line in a file, without building the whole line in memory. This enum wo=
uld yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too=
 many string copy/concat operations. I think I can do it myself but I&#39;m=
 not too confident regarding good practices on buffered reads of files. May=
be there are some good examples in Batteries?<br>


<br>Thanks again,<br>=A0 ph.<br><br><br>
</blockquote></div><br>
</div></div></blockquote></div><br>

--20cf307d0102534a8104bb94e733--