From mboxrd@z Thu Jan  1 00:00:00 1970
X-Sympa-To: caml-list@inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2L7MAA1015323
	for <caml-list@sympa-roc.inria.fr>; Wed, 21 Mar 2012 08:22:11 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqgBAIeAaU/RVdQ2kGdsb2JhbABEtnMIIgEBAQEJCQ0HFAQjggkBAQEDAQwGAiwBGx0BAwELBgULAwonByEBAREBBQELEQYTGweHYwWZVQqME4JxhSQ/iHYBBQuJYWCGNQSVXosugxo9hAqBWw
X-IronPort-AV: E=Sophos;i="4.73,621,1325458800"; 
   d="scan'208";a="150470957"
Received: from mail-vb0-f54.google.com ([209.85.212.54])
  by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 21 Mar 2012 08:22:04 +0100
Received: by vbmv11 with SMTP id v11so692818vbm.27
        for <caml-list@inria.fr>; Wed, 21 Mar 2012 00:22:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        bh=4qErac8OUO6vwO26xFFmJ6OXT+5eHgZ01WS1oRrqY+U=;
        b=wstKHIFUUK9NJFA93XkH48jMurVA0XhO5GMLC9uQ6sm0Ut0SmQZ5/Gu1G3zu1p+wif
         uSeHRNwLleTBMG2YoaUtXKTGtht3ou0Havnob8S3IMVExhRPXnQ3mnMKXsADJ87Kb9vI
         NroGQN+KZC+4ZdeGiPoubdFrzyo2dGMeE4svcK0SS4cnz/q9ZhCexug5706H32y09Gy0
         6ieigbtXbiwE66rBhSlBTxCObrNTqAZgN7F9H8xDyKJy5bmoZnx6Io3Gzj/fJ4g9KBMI
         LLJjUdJZ2I5FfrhnqnYUIQ/2b3NEUo15B35fBuT4LmEDMjgAZJEQEQzUY1vSU5nk2HWD
         qTxg==
Received: by 10.220.148.139 with SMTP id p11mr1509533vcv.48.1332314523547;
 Wed, 21 Mar 2012 00:22:03 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.52.31.136 with HTTP; Wed, 21 Mar 2012 00:21:43 -0700 (PDT)
In-Reply-To: <4F673830.4010907@gmail.com>
References: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com>
 <4F634AD1.20402@gmail.com> <CAOOOohRVtAkiKGC=9GQNUEt=7d=z6kSOU=uDEc2ALZgFepWABg@mail.gmail.com>
 <CAL-jcAm5dfS0CEB=rCry=auP9FeyVM5Cy3PkNhoT4rX2AXX5rg@mail.gmail.com>
 <CAOOOohQ3e3LXk_qMmm__0XeWADjZAGqWmpS7UFGNyt-6bFeY3A@mail.gmail.com> <4F673830.4010907@gmail.com>
From: Philippe Veber <philippe.veber@gmail.com>
Date: Wed, 21 Mar 2012 08:21:43 +0100
Message-ID: <CAOOOohThYfSTGTqvYoKPioxLGnNKaCnR4ow11vwXT4NnXyp-Dw@mail.gmail.com>
To: Edgar Friendly <thelema314@gmail.com>
Cc: "caml-list@inria.fr" <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=f46d042f9046e7ba1204bbbba578
X-Validation-by: philippe.veber@gmail.com
Subject: Re: [Caml-list] Efficient scanning of large strings from files


--f46d042f9046e7ba1204bbbba578
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

2012/3/19 Edgar Friendly <thelema314@gmail.com>

> On 03/19/2012 05:08 AM, Philippe Veber wrote:
>
>> Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I
>> just hope that a repeated use of input_char is not 10-100X slower than
>> input_line :o).
>> ph.
>>
>>  Quite true - instead of giving the matcher just a single byte at a time,
> it is more efficient to give it blocks of data, as long as it can keep its
> state from one block to the next.  But its matching internally will be on
> one byte at a time, normally.

Thanks for the confirmation, I now see more clearly what to do.


> I guess with DNA, because of the reduced character set, it'd be possible
> to get each symbol down to 2 bits (if you're really just using ACGT), in
> which case, the matcher could run 4 basepairs at a time, but there's a lot
> of corner issues doing things that way.  A lot depends on how much time a=
nd
> effort you're willing to spend engineering something.
>
Maybe not that far yet, but this is something we've mentionned for biocaml.
I guess I could take some inspiration from the bitset module in Batteries.
Anyway thanks everybody for your help!
ph.


>
> E.
>
>  2012/3/16 Edgar Friendly <thelema314@gmail.com
>> <mailto:thelema314@gmail.com>>
>>
>>
>>    So given a large file and a line number, you want to:
>>    1) extract that line from the file
>>    2) produce an enum of all k-length slices of that line?
>>    3) match each slice against your regexp set to produce a list/enum
>>    of substrings that match the regexps?
>>    Without reading the whole line into memory at once.
>>
>>    I'm with Dimino on the right solution - just use a matcher that that
>>    works incrementally, feed it one byte at a time, and have it return
>>    a list of match offsets.  Then work backwards from these endpoints
>>    to figure out which substrings you want.
>>
>>    There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
>>    should suffice to use (0,k-1) and (k,2k-1) with an incremental
>>    matching routine.
>>
>>    E.
>>
>>
>>
>>    On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
>>    <philippe.veber@gmail.com <mailto:philippe.veber@gmail.**com<philippe=
.veber@gmail.com>>>
>> wrote:
>>
>>        Thank you Edgar for your answer (and also Christophe). It seems
>>        my question was a bit misleading: actually I target a subset of
>>        regexps whose matching is really trivial, so this is no worry
>>        for me. I was more interested in how accessing a large line in a
>>        file by chunks of fixed length k. For instance how to build a
>>        [Substring.t Enum.t] from some line in a file, without building
>>        the whole line in memory. This enum would yield the substrings
>>        (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
>>        copy/concat operations. I think I can do it myself but I'm not
>>        too confident regarding good practices on buffered reads of
>>        files. Maybe there are some good examples in Batteries?
>>
>>        Thanks again,
>>           ph.
>>
>>
>>
>>
>>
>

--f46d042f9046e7ba1204bbbba578
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">2012/3/19 Edgar Friendly <span dir=3D"lt=
r">&lt;<a href=3D"mailto:thelema314@gmail.com">thelema314@gmail.com</a>&gt;=
</span><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex">

<div class=3D"im">On 03/19/2012 05:08 AM, Philippe Veber wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I<br>
just hope that a repeated use of input_char is not 10-100X slower than<br>
input_line :o).<br>
ph.<br>
<br>
</blockquote></div>
Quite true - instead of giving the matcher just a single byte at a time, it=
 is more efficient to give it blocks of data, as long as it can keep its st=
ate from one block to the next. =A0But its matching internally will be on o=
ne byte at a time, normally. =A0</blockquote>

<div>Thanks for the confirmation, I now see more clearly what to do.<br>=A0=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">I guess with DNA, b=
ecause of the reduced character set, it&#39;d be possible to get each symbo=
l down to 2 bits (if you&#39;re really just using ACGT), in which case, the=
 matcher could run 4 basepairs at a time, but there&#39;s a lot of corner i=
ssues doing things that way. =A0A lot depends on how much time and effort y=
ou&#39;re willing to spend engineering something.<br>

</blockquote><div>Maybe not that far yet, but this is something we&#39;ve m=
entionned for biocaml. I guess I could take some inspiration from the bitse=
t module in Batteries.<br>Anyway thanks everybody for your help!<br>ph.<br>

=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
E.<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
2012/3/16 Edgar Friendly &lt;<a href=3D"mailto:thelema314@gmail.com" target=
=3D"_blank">thelema314@gmail.com</a><br>
&lt;mailto:<a href=3D"mailto:thelema314@gmail.com" target=3D"_blank">thelem=
a314@gmail.com</a>&gt;&gt;<div class=3D"im"><br>
<br>
 =A0 =A0So given a large file and a line number, you want to:<br>
 =A0 =A01) extract that line from the file<br>
 =A0 =A02) produce an enum of all k-length slices of that line?<br>
 =A0 =A03) match each slice against your regexp set to produce a list/enum<=
br>
 =A0 =A0of substrings that match the regexps?<br>
 =A0 =A0Without reading the whole line into memory at once.<br>
<br>
 =A0 =A0I&#39;m with Dimino on the right solution - just use a matcher that=
 that<br>
 =A0 =A0works incrementally, feed it one byte at a time, and have it return=
<br>
 =A0 =A0a list of match offsets. =A0Then work backwards from these endpoint=
s<br>
 =A0 =A0to figure out which substrings you want.<br>
<br>
 =A0 =A0There shouldn&#39;t be a reason to use substrings (0,k-1) and (1,k)=
 - it<br>
 =A0 =A0should suffice to use (0,k-1) and (k,2k-1) with an incremental<br>
 =A0 =A0matching routine.<br>
<br>
 =A0 =A0E.<br>
<br>
<br>
<br>
 =A0 =A0On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber<br></div><div clas=
s=3D"im">
 =A0 =A0&lt;<a href=3D"mailto:philippe.veber@gmail.com" target=3D"_blank">p=
hilippe.veber@gmail.com</a> &lt;mailto:<a href=3D"mailto:philippe.veber@gma=
il.com" target=3D"_blank">philippe.veber@gmail.<u></u>com</a>&gt;&gt; wrote=
:<br>
<br>
 =A0 =A0 =A0 =A0Thank you Edgar for your answer (and also Christophe). It s=
eems<br>
 =A0 =A0 =A0 =A0my question was a bit misleading: actually I target a subse=
t of<br>
 =A0 =A0 =A0 =A0regexps whose matching is really trivial, so this is no wor=
ry<br>
 =A0 =A0 =A0 =A0for me. I was more interested in how accessing a large line=
 in a<br>
 =A0 =A0 =A0 =A0file by chunks of fixed length k. For instance how to build=
 a<br>
 =A0 =A0 =A0 =A0[Substring.t Enum.t] from some line in a file, without buil=
ding<br>
 =A0 =A0 =A0 =A0the whole line in memory. This enum would yield the substri=
ngs<br>
 =A0 =A0 =A0 =A0(0,k-1), (1,k), (2,k+1), etc ... without doing too many str=
ing<br>
 =A0 =A0 =A0 =A0copy/concat operations. I think I can do it myself but I=
9;m not<br>
 =A0 =A0 =A0 =A0too confident regarding good practices on buffered reads of=
<br>
 =A0 =A0 =A0 =A0files. Maybe there are some good examples in Batteries?<br>
<br>
 =A0 =A0 =A0 =A0Thanks again,<br>
 =A0 =A0 =A0 =A0 =A0 ph.<br>
<br>
<br>
<br>
<br>
</div></blockquote>
<br>
</blockquote></div><br>

--f46d042f9046e7ba1204bbbba578--