From mboxrd@z Thu Jan 1 00:00:00 1970 X-Sympa-To: caml-list@inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2L7MAA1015323 for ; Wed, 21 Mar 2012 08:22:11 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqgBAIeAaU/RVdQ2kGdsb2JhbABEtnMIIgEBAQEJCQ0HFAQjggkBAQEDAQwGAiwBGx0BAwELBgULAwonByEBAREBBQELEQYTGweHYwWZVQqME4JxhSQ/iHYBBQuJYWCGNQSVXosugxo9hAqBWw X-IronPort-AV: E=Sophos;i="4.73,621,1325458800"; d="scan'208";a="150470957" Received: from mail-vb0-f54.google.com ([209.85.212.54]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 21 Mar 2012 08:22:04 +0100 Received: by vbmv11 with SMTP id v11so692818vbm.27 for ; Wed, 21 Mar 2012 00:22:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=4qErac8OUO6vwO26xFFmJ6OXT+5eHgZ01WS1oRrqY+U=; b=wstKHIFUUK9NJFA93XkH48jMurVA0XhO5GMLC9uQ6sm0Ut0SmQZ5/Gu1G3zu1p+wif uSeHRNwLleTBMG2YoaUtXKTGtht3ou0Havnob8S3IMVExhRPXnQ3mnMKXsADJ87Kb9vI NroGQN+KZC+4ZdeGiPoubdFrzyo2dGMeE4svcK0SS4cnz/q9ZhCexug5706H32y09Gy0 6ieigbtXbiwE66rBhSlBTxCObrNTqAZgN7F9H8xDyKJy5bmoZnx6Io3Gzj/fJ4g9KBMI LLJjUdJZ2I5FfrhnqnYUIQ/2b3NEUo15B35fBuT4LmEDMjgAZJEQEQzUY1vSU5nk2HWD qTxg== Received: by 10.220.148.139 with SMTP id p11mr1509533vcv.48.1332314523547; Wed, 21 Mar 2012 00:22:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.31.136 with HTTP; Wed, 21 Mar 2012 00:21:43 -0700 (PDT) In-Reply-To: <4F673830.4010907@gmail.com> References: <4F634AD1.20402@gmail.com> <4F673830.4010907@gmail.com> From: Philippe Veber Date: Wed, 21 Mar 2012 08:21:43 +0100 Message-ID: To: Edgar Friendly Cc: "caml-list@inria.fr" Content-Type: multipart/alternative; boundary=f46d042f9046e7ba1204bbbba578 X-Validation-by: philippe.veber@gmail.com Subject: Re: [Caml-list] Efficient scanning of large strings from files --f46d042f9046e7ba1204bbbba578 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 2012/3/19 Edgar Friendly > On 03/19/2012 05:08 AM, Philippe Veber wrote: > >> Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I >> just hope that a repeated use of input_char is not 10-100X slower than >> input_line :o). >> ph. >> >> Quite true - instead of giving the matcher just a single byte at a time, > it is more efficient to give it blocks of data, as long as it can keep its > state from one block to the next. But its matching internally will be on > one byte at a time, normally. Thanks for the confirmation, I now see more clearly what to do. > I guess with DNA, because of the reduced character set, it'd be possible > to get each symbol down to 2 bits (if you're really just using ACGT), in > which case, the matcher could run 4 basepairs at a time, but there's a lot > of corner issues doing things that way. A lot depends on how much time a= nd > effort you're willing to spend engineering something. > Maybe not that far yet, but this is something we've mentionned for biocaml. I guess I could take some inspiration from the bitset module in Batteries. Anyway thanks everybody for your help! ph. > > E. > > 2012/3/16 Edgar Friendly > > >> >> >> So given a large file and a line number, you want to: >> 1) extract that line from the file >> 2) produce an enum of all k-length slices of that line? >> 3) match each slice against your regexp set to produce a list/enum >> of substrings that match the regexps? >> Without reading the whole line into memory at once. >> >> I'm with Dimino on the right solution - just use a matcher that that >> works incrementally, feed it one byte at a time, and have it return >> a list of match offsets. Then work backwards from these endpoints >> to figure out which substrings you want. >> >> There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it >> should suffice to use (0,k-1) and (k,2k-1) with an incremental >> matching routine. >> >> E. >> >> >> >> On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber >> >> >> wrote: >> >> Thank you Edgar for your answer (and also Christophe). It seems >> my question was a bit misleading: actually I target a subset of >> regexps whose matching is really trivial, so this is no worry >> for me. I was more interested in how accessing a large line in a >> file by chunks of fixed length k. For instance how to build a >> [Substring.t Enum.t] from some line in a file, without building >> the whole line in memory. This enum would yield the substrings >> (0,k-1), (1,k), (2,k+1), etc ... without doing too many string >> copy/concat operations. I think I can do it myself but I'm not >> too confident regarding good practices on buffered reads of >> files. Maybe there are some good examples in Batteries? >> >> Thanks again, >> ph. >> >> >> >> >> > --f46d042f9046e7ba1204bbbba578 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

2012/3/19 Edgar Friendly <thelema314@gmail.com>=
On 03/19/2012 05:08 AM, Philippe Veber wrote:
Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I
just hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

Quite true - instead of giving the matcher just a single byte at a time, it= is more efficient to give it blocks of data, as long as it can keep its st= ate from one block to the next. =A0But its matching internally will be on o= ne byte at a time, normally. =A0
Thanks for the confirmation, I now see more clearly what to do.
=A0=
I guess with DNA, b= ecause of the reduced character set, it'd be possible to get each symbo= l down to 2 bits (if you're really just using ACGT), in which case, the= matcher could run 4 basepairs at a time, but there's a lot of corner i= ssues doing things that way. =A0A lot depends on how much time and effort y= ou're willing to spend engineering something.
Maybe not that far yet, but this is something we've m= entionned for biocaml. I guess I could take some inspiration from the bitse= t module in Batteries.
Anyway thanks everybody for your help!
ph.
=A0

E.

2012/3/16 Edgar Friendly <thelema314@gmail.com
<mailto:thelem= a314@gmail.com>>


=A0 =A0So given a large file and a line number, you want to:
=A0 =A01) extract that line from the file
=A0 =A02) produce an enum of all k-length slices of that line?
=A0 =A03) match each slice against your regexp set to produce a list/enum<= br> =A0 =A0of substrings that match the regexps?
=A0 =A0Without reading the whole line into memory at once.

=A0 =A0I'm with Dimino on the right solution - just use a matcher that= that
=A0 =A0works incrementally, feed it one byte at a time, and have it return=
=A0 =A0a list of match offsets. =A0Then work backwards from these endpoint= s
=A0 =A0to figure out which substrings you want.

=A0 =A0There shouldn't be a reason to use substrings (0,k-1) and (1,k)= - it
=A0 =A0should suffice to use (0,k-1) and (k,2k-1) with an incremental
=A0 =A0matching routine.

=A0 =A0E.



=A0 =A0On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
=A0 =A0<p= hilippe.veber@gmail.com <mailto:philippe.veber@gmail.com>> wrote= :

=A0 =A0 =A0 =A0Thank you Edgar for your answer (and also Christophe). It s= eems
=A0 =A0 =A0 =A0my question was a bit misleading: actually I target a subse= t of
=A0 =A0 =A0 =A0regexps whose matching is really trivial, so this is no wor= ry
=A0 =A0 =A0 =A0for me. I was more interested in how accessing a large line= in a
=A0 =A0 =A0 =A0file by chunks of fixed length k. For instance how to build= a
=A0 =A0 =A0 =A0[Substring.t Enum.t] from some line in a file, without buil= ding
=A0 =A0 =A0 =A0the whole line in memory. This enum would yield the substri= ngs
=A0 =A0 =A0 =A0(0,k-1), (1,k), (2,k+1), etc ... without doing too many str= ing
=A0 =A0 =A0 =A0copy/concat operations. I think I can do it myself but I= 9;m not
=A0 =A0 =A0 =A0too confident regarding good practices on buffered reads of=
=A0 =A0 =A0 =A0files. Maybe there are some good examples in Batteries?

=A0 =A0 =A0 =A0Thanks again,
=A0 =A0 =A0 =A0 =A0 ph.






--f46d042f9046e7ba1204bbbba578--