caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Efficient scanning of large strings from files
@ 2012-03-16 13:03 Philippe Veber
  2012-03-16 14:14 ` Edgar Friendly
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Philippe Veber @ 2012-03-16 13:03 UTC (permalink / raw)
  To: caml users

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

Dear camlers,

Say that you'd like to search a regexp on a file with lines so long that
you'd rather not load them entirely at once. If you can bound the size of a
match by k << length of a line, then you know that you can only keep a
small portion of the line in memory to search the regexp. Typically you'd
like to access substrings of size k from left to right. I guess such a
thing should involve buffered inputs and avoid copying strings as much as
possible. My question is as follows: has anybody written a library to
access these substrings gracefully and with decent performance?
Cheers,
  Philippe.

[-- Attachment #2: Type: text/html, Size: 651 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber
@ 2012-03-16 14:14 ` Edgar Friendly
  2012-03-16 14:48   ` Philippe Veber
  2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
  2012-03-16 14:49 ` Jérémie Dimino
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Edgar Friendly @ 2012-03-16 14:14 UTC (permalink / raw)
  To: Philippe Veber, caml-list

On 03/16/2012 09:03 AM, Philippe Veber wrote:
> Dear camlers,
>
> Say that you'd like to search a regexp on a file with lines so long that
> you'd rather not load them entirely at once. If you can bound the size
> of a match by k << length of a line, then you know that you can only
> keep a small portion of the line in memory to search the regexp.
> Typically you'd like to access substrings of size k from left to right.
> I guess such a thing should involve buffered inputs and avoid copying
> strings as much as possible. My question is as follows: has anybody
> written a library to access these substrings gracefully and with decent
> performance?
> Cheers,
>    Philippe.
>
This is tricky to do, as incremental matching implies DFA-based 
matching, but returning matching substrings implies backtrack-based 
matching.  The re2 library returns matching substrings by matching 
forward to find the end of patterns, and then matching backwards on the 
reversed regex from that point to find their beginning.  I don't know if 
even it supports this on incremental input.

Subject to the assumption that starting to match implies either 
finishing or aborting a match (i.e. once you've started to match your 
regex, no other match will start before either this match attempt 
finishes successful or not), this is not unreasonable to do.  Without 
this assumption, it requires tracking many match start locations and 
somehow knowing which of them match or fail to match.  I'm not sure this 
has been done before.

E.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 14:14 ` Edgar Friendly
@ 2012-03-16 14:48   ` Philippe Veber
  2012-03-16 17:02     ` Edgar Friendly
  2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
  1 sibling, 1 reply; 15+ messages in thread
From: Philippe Veber @ 2012-03-16 14:48 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2314 bytes --]

2012/3/16 Edgar Friendly <thelema314@gmail.com>

> On 03/16/2012 09:03 AM, Philippe Veber wrote:
>
>> Dear camlers,
>>
>> Say that you'd like to search a regexp on a file with lines so long that
>> you'd rather not load them entirely at once. If you can bound the size
>> of a match by k << length of a line, then you know that you can only
>> keep a small portion of the line in memory to search the regexp.
>> Typically you'd like to access substrings of size k from left to right.
>> I guess such a thing should involve buffered inputs and avoid copying
>> strings as much as possible. My question is as follows: has anybody
>> written a library to access these substrings gracefully and with decent
>> performance?
>> Cheers,
>>   Philippe.
>>
>>  This is tricky to do, as incremental matching implies DFA-based
> matching, but returning matching substrings implies backtrack-based
> matching.  The re2 library returns matching substrings by matching forward
> to find the end of patterns, and then matching backwards on the reversed
> regex from that point to find their beginning.  I don't know if even it
> supports this on incremental input.
>
> Subject to the assumption that starting to match implies either finishing
> or aborting a match (i.e. once you've started to match your regex, no other
> match will start before either this match attempt finishes successful or
> not), this is not unreasonable to do.  Without this assumption, it requires
> tracking many match start locations and somehow knowing which of them match
> or fail to match.  I'm not sure this has been done before.
>
> E.
>
Thank you Edgar for your answer (and also Christophe). It seems my question
was a bit misleading: actually I target a subset of regexps whose matching
is really trivial, so this is no worry for me. I was more interested in how
accessing a large line in a file by chunks of fixed length k. For instance
how to build a [Substring.t Enum.t] from some line in a file, without
building the whole line in memory. This enum would yield the substrings
(0,k-1), (1,k), (2,k+1), etc ... without doing too many string copy/concat
operations. I think I can do it myself but I'm not too confident regarding
good practices on buffered reads of files. Maybe there are some good
examples in Batteries?

Thanks again,
  ph.

[-- Attachment #2: Type: text/html, Size: 2862 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber
  2012-03-16 14:14 ` Edgar Friendly
@ 2012-03-16 14:49 ` Jérémie Dimino
  2012-03-18 21:11   ` Török Edwin
  2012-03-16 20:11 ` oliver
  2012-03-18 23:56 ` oliver
  3 siblings, 1 reply; 15+ messages in thread
From: Jérémie Dimino @ 2012-03-16 14:49 UTC (permalink / raw)
  To: Philippe Veber; +Cc: caml users

Le Fri, 16 Mar 2012 14:03:38 +0100,
Philippe Veber <philippe.veber@gmail.com> a écrit :

> Say that you'd like to search a regexp on a file with lines so long
> that you'd rather not load them entirely at once. If you can bound
> the size of a match by k << length of a line, then you know that you
> can only keep a small portion of the line in memory to search the
> regexp. Typically you'd like to access substrings of size k from left
> to right. I guess such a thing should involve buffered inputs and
> avoid copying strings as much as possible. My question is as follows:
> has anybody written a library to access these substrings gracefully
> and with decent performance? Cheers,

You can use a non-backtracking regexp library to find offsets of the
substrings, then seek in the file to extract them. You can use for
example the libre library from Jérôme Vouillon [1]. It only accept
strings as input but it would be really easy to make it work on input
channels (just replace "s.[pos]" by "input_char ic").

  [1] http://sourceforge.net/projects/libre/
      https://github.com/avsm/ocaml-re.git

Cheers,

-- 
Jérémie


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 14:48   ` Philippe Veber
@ 2012-03-16 17:02     ` Edgar Friendly
  2012-03-19  9:08       ` Philippe Veber
  0 siblings, 1 reply; 15+ messages in thread
From: Edgar Friendly @ 2012-03-16 17:02 UTC (permalink / raw)
  To: Philippe Veber; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

So given a large file and a line number, you want to:
1) extract that line from the file
2) produce an enum of all k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum of
substrings that match the regexps?
Without reading the whole line into memory at once.

I'm with Dimino on the right solution - just use a matcher that that works
incrementally, feed it one byte at a time, and have it return a list of
match offsets.  Then work backwards from these endpoints to figure out
which substrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should
suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine.

E.


On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
<philippe.veber@gmail.com>wrote:

> Thank you Edgar for your answer (and also Christophe). It seems my
> question was a bit misleading: actually I target a subset of regexps whose
> matching is really trivial, so this is no worry for me. I was more
> interested in how accessing a large line in a file by chunks of fixed
> length k. For instance how to build a [Substring.t Enum.t] from some line
> in a file, without building the whole line in memory. This enum would yield
> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many
> string copy/concat operations. I think I can do it myself but I'm not too
> confident regarding good practices on buffered reads of files. Maybe there
> are some good examples in Batteries?
>
> Thanks again,
>   ph.
>
>
>

[-- Attachment #2: Type: text/html, Size: 1815 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 14:14 ` Edgar Friendly
  2012-03-16 14:48   ` Philippe Veber
@ 2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
  2012-03-17 16:53     ` oliver
  2012-03-19  9:08     ` Philippe Veber
  1 sibling, 2 replies; 15+ messages in thread
From: Francois????Charles Matthieu????Berenger @ 2012-03-16 17:23 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: Philippe Veber, caml-list

hi philippe,

i am curious, is your string a dna sequenceso that s why it is so long?

regards,f


On Fri, 16 Mar 2012 10:14:41 -0400
Edgar Friendly <thelema314@gmail.com> wrote:

> On 03/16/2012 09:03 AM, Philippe Veber wrote:
> > Dear camlers,
> >
> > Say that you'd like to search a regexp on a file with lines so long that
> > you'd rather not load them entirely at once. If you can bound the size
> > of a match by k << length of a line, then you know that you can only
> > keep a small portion of the line in memory to search the regexp.
> > Typically you'd like to access substrings of size k from left to right.
> > I guess such a thing should involve buffered inputs and avoid copying
> > strings as much as possible. My question is as follows: has anybody
> > written a library to access these substrings gracefully and with decent
> > performance?
> > Cheers,
> >    Philippe.
> >
> This is tricky to do, as incremental matching implies DFA-based 
> matching, but returning matching substrings implies backtrack-based 
> matching.  The re2 library returns matching substrings by matching 
> forward to find the end of patterns, and then matching backwards on the 
> reversed regex from that point to find their beginning.  I don't know if 
> even it supports this on incremental input.
> 
> Subject to the assumption that starting to match implies either 
> finishing or aborting a match (i.e. once you've started to match your 
> regex, no other match will start before either this match attempt 
> finishes successful or not), this is not unreasonable to do.  Without 
> this assumption, it requires tracking many match start locations and 
> somehow knowing which of them match or fail to match.  I'm not sure this 
> has been done before.
> 
> E.
> 
> -- 
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber
  2012-03-16 14:14 ` Edgar Friendly
  2012-03-16 14:49 ` Jérémie Dimino
@ 2012-03-16 20:11 ` oliver
  2012-03-18 23:56 ` oliver
  3 siblings, 0 replies; 15+ messages in thread
From: oliver @ 2012-03-16 20:11 UTC (permalink / raw)
  To: Philippe Veber; +Cc: caml users

On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote:
> Dear camlers,
> 
> Say that you'd like to search a regexp on a file with lines so long that
> you'd rather not load them entirely at once. If you can bound the size of a
> match by k << length of a line, then you know that you can only keep a
> small portion of the line in memory to search the regexp. Typically you'd
> like to access substrings of size k from left to right. I guess such a
> thing should involve buffered inputs and avoid copying strings as much as
> possible. My question is as follows: has anybody written a library to
> access these substrings gracefully and with decent performance?
> Cheers,
>   Philippe.
[...]


I think, the regexp itself also has an impact on
how fast and/or how easy this can be achieved.

The more complex the Regexp, the more ressources
you will need.

If you can make your regexp becoming boult down to something
easy parseable, the length of lines might be of no importance.


Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
@ 2012-03-17 16:53     ` oliver
  2012-03-19  9:08     ` Philippe Veber
  1 sibling, 0 replies; 15+ messages in thread
From: oliver @ 2012-03-17 16:53 UTC (permalink / raw)
  To: Francois????Charles Matthieu????Berenger
  Cc: Edgar Friendly, Philippe Veber, caml-list


DNA:  C, G, A, T, \n



On Sat, Mar 17, 2012 at 02:23:48AM +0900, Francois????Charles Matthieu????Berenger wrote:
> hi philippe,
> 
> i am curious, is your string a dna sequenceso that s why it is so long?
> 
> regards,f
> 
> 
> On Fri, 16 Mar 2012 10:14:41 -0400
> Edgar Friendly <thelema314@gmail.com> wrote:
> 
> > On 03/16/2012 09:03 AM, Philippe Veber wrote:
> > > Dear camlers,
> > >
> > > Say that you'd like to search a regexp on a file with lines so long that
> > > you'd rather not load them entirely at once. If you can bound the size
> > > of a match by k << length of a line, then you know that you can only
> > > keep a small portion of the line in memory to search the regexp.
> > > Typically you'd like to access substrings of size k from left to right.
> > > I guess such a thing should involve buffered inputs and avoid copying
> > > strings as much as possible. My question is as follows: has anybody
> > > written a library to access these substrings gracefully and with decent
> > > performance?
> > > Cheers,
> > >    Philippe.
> > >
> > This is tricky to do, as incremental matching implies DFA-based 
> > matching, but returning matching substrings implies backtrack-based 
> > matching.  The re2 library returns matching substrings by matching 
> > forward to find the end of patterns, and then matching backwards on the 
> > reversed regex from that point to find their beginning.  I don't know if 
> > even it supports this on incremental input.
> > 
> > Subject to the assumption that starting to match implies either 
> > finishing or aborting a match (i.e. once you've started to match your 
> > regex, no other match will start before either this match attempt 
> > finishes successful or not), this is not unreasonable to do.  Without 
> > this assumption, it requires tracking many match start locations and 
> > somehow knowing which of them match or fail to match.  I'm not sure this 
> > has been done before.
> > 
> > E.
> > 
> > -- 
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa-roc.inria.fr/wws/info/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> > 
> 
> 
> 
> 
> -- 
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 14:49 ` Jérémie Dimino
@ 2012-03-18 21:11   ` Török Edwin
  2012-03-19  9:11     ` Philippe Veber
  0 siblings, 1 reply; 15+ messages in thread
From: Török Edwin @ 2012-03-18 21:11 UTC (permalink / raw)
  To: caml-list

On 03/16/2012 04:49 PM, Jérémie Dimino wrote:
> Le Fri, 16 Mar 2012 14:03:38 +0100,
> Philippe Veber <philippe.veber@gmail.com> a écrit :
> 
>> Say that you'd like to search a regexp on a file with lines so long
>> that you'd rather not load them entirely at once. If you can bound
>> the size of a match by k << length of a line, then you know that you
>> can only keep a small portion of the line in memory to search the
>> regexp. Typically you'd like to access substrings of size k from left
>> to right. I guess such a thing should involve buffered inputs and
>> avoid copying strings as much as possible. My question is as follows:
>> has anybody written a library to access these substrings gracefully
>> and with decent performance? Cheers,
> 
> You can use a non-backtracking regexp library to find offsets of the
> substrings, then seek in the file to extract them. You can use for
> example the libre library from Jérôme Vouillon [1]. It only accept
> strings as input but it would be really easy to make it work on input
> channels (just replace "s.[pos]" by "input_char ic").
> 
>   [1] http://sourceforge.net/projects/libre/
>       https://github.com/avsm/ocaml-re.git
> 

A nice library for regular expression matching is LibTRE (BSD licensed),
and it has a way to parse arbitrary data with callbacks:
http://laurikari.net/tre/documentation/reguexec/

According to the paper it is also good at finding substring matches
with its tagged NFA:
http://laurikari.net/ville/regex-submatch.pdf

If you don't use back-references (!tre_have_backrefs) then it guarantees linear-time matching.

I couldn't find an OCaml wrapper for it, but should be simple enough to write one.

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber
                   ` (2 preceding siblings ...)
  2012-03-16 20:11 ` oliver
@ 2012-03-18 23:56 ` oliver
  3 siblings, 0 replies; 15+ messages in thread
From: oliver @ 2012-03-18 23:56 UTC (permalink / raw)
  To: Philippe Veber; +Cc: caml users

On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote:
> Dear camlers,
> 
> Say that you'd like to search a regexp on a file with lines so long that
> you'd rather not load them entirely at once. If you can bound the size of a
> match by k << length of a line, then you know that you can only keep a
> small portion of the line in memory to search the regexp. Typically you'd
> like to access substrings of size k from left to right. I guess such a
> thing should involve buffered inputs and avoid copying strings as much as
> possible. My question is as follows: has anybody written a library to
> access these substrings gracefully and with decent performance?
> Cheers,
>   Philippe.

To your question of such a library: I don't know such a lib.
I wonder if your lines would fill some GB or RAM...?!

Not sure if it matches your question, but if there is no such lib,
you maybe want to implement the Regexp-serach by yourself...?!

==> http://swtch.com/~rsc/regexp/regexp1.html


For fast input the Buffe Module is really a performance boost,
compared to normal string-appending operations.

==> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Buffer.html


Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 17:02     ` Edgar Friendly
@ 2012-03-19  9:08       ` Philippe Veber
  2012-03-19 13:44         ` Edgar Friendly
  0 siblings, 1 reply; 15+ messages in thread
From: Philippe Veber @ 2012-03-19  9:08 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1843 bytes --]

Thanks Edgar and Jérémie, this indeed seems to be the right track. I just
hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

2012/3/16 Edgar Friendly <thelema314@gmail.com>

> So given a large file and a line number, you want to:
> 1) extract that line from the file
> 2) produce an enum of all k-length slices of that line?
> 3) match each slice against your regexp set to produce a list/enum of
> substrings that match the regexps?
> Without reading the whole line into memory at once.
>
> I'm with Dimino on the right solution - just use a matcher that that works
> incrementally, feed it one byte at a time, and have it return a list of
> match offsets.  Then work backwards from these endpoints to figure out
> which substrings you want.
>
> There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
> should suffice to use (0,k-1) and (k,2k-1) with an incremental matching
> routine.
>
> E.
>
>
>
> On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com
> > wrote:
>
>> Thank you Edgar for your answer (and also Christophe). It seems my
>> question was a bit misleading: actually I target a subset of regexps whose
>> matching is really trivial, so this is no worry for me. I was more
>> interested in how accessing a large line in a file by chunks of fixed
>> length k. For instance how to build a [Substring.t Enum.t] from some line
>> in a file, without building the whole line in memory. This enum would yield
>> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many
>> string copy/concat operations. I think I can do it myself but I'm not too
>> confident regarding good practices on buffered reads of files. Maybe there
>> are some good examples in Batteries?
>>
>> Thanks again,
>>   ph.
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 2401 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
  2012-03-17 16:53     ` oliver
@ 2012-03-19  9:08     ` Philippe Veber
  1 sibling, 0 replies; 15+ messages in thread
From: Philippe Veber @ 2012-03-19  9:08 UTC (permalink / raw)
  To: Francois????Charles Matthieu????Berenger; +Cc: Edgar Friendly, caml-list

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

Yes indeed!

2012/3/16 Francois????Charles Matthieu????Berenger <Berenger@riken.jp>

> hi philippe,
>
> i am curious, is your string a dna sequenceso that s why it is so long?
>
> regards,f
>
>
> On Fri, 16 Mar 2012 10:14:41 -0400
> Edgar Friendly <thelema314@gmail.com> wrote:
>
> > On 03/16/2012 09:03 AM, Philippe Veber wrote:
> > > Dear camlers,
> > >
> > > Say that you'd like to search a regexp on a file with lines so long
> that
> > > you'd rather not load them entirely at once. If you can bound the size
> > > of a match by k << length of a line, then you know that you can only
> > > keep a small portion of the line in memory to search the regexp.
> > > Typically you'd like to access substrings of size k from left to right.
> > > I guess such a thing should involve buffered inputs and avoid copying
> > > strings as much as possible. My question is as follows: has anybody
> > > written a library to access these substrings gracefully and with decent
> > > performance?
> > > Cheers,
> > >    Philippe.
> > >
> > This is tricky to do, as incremental matching implies DFA-based
> > matching, but returning matching substrings implies backtrack-based
> > matching.  The re2 library returns matching substrings by matching
> > forward to find the end of patterns, and then matching backwards on the
> > reversed regex from that point to find their beginning.  I don't know if
> > even it supports this on incremental input.
> >
> > Subject to the assumption that starting to match implies either
> > finishing or aborting a match (i.e. once you've started to match your
> > regex, no other match will start before either this match attempt
> > finishes successful or not), this is not unreasonable to do.  Without
> > this assumption, it requires tracking many match start locations and
> > somehow knowing which of them match or fail to match.  I'm not sure this
> > has been done before.
> >
> > E.
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa-roc.inria.fr/wws/info/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 3210 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-18 21:11   ` Török Edwin
@ 2012-03-19  9:11     ` Philippe Veber
  0 siblings, 0 replies; 15+ messages in thread
From: Philippe Veber @ 2012-03-19  9:11 UTC (permalink / raw)
  To: Török Edwin; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]

Thanks Edwin and Oliver, I wasn't aware of these libraries. I'll definitely
have a look.
Thanks again everybody for your help on this!
ph.

2012/3/18 Török Edwin <edwin+ml-ocaml@etorok.net>

> On 03/16/2012 04:49 PM, Jérémie Dimino wrote:
> > Le Fri, 16 Mar 2012 14:03:38 +0100,
> > Philippe Veber <philippe.veber@gmail.com> a écrit :
> >
> >> Say that you'd like to search a regexp on a file with lines so long
> >> that you'd rather not load them entirely at once. If you can bound
> >> the size of a match by k << length of a line, then you know that you
> >> can only keep a small portion of the line in memory to search the
> >> regexp. Typically you'd like to access substrings of size k from left
> >> to right. I guess such a thing should involve buffered inputs and
> >> avoid copying strings as much as possible. My question is as follows:
> >> has anybody written a library to access these substrings gracefully
> >> and with decent performance? Cheers,
> >
> > You can use a non-backtracking regexp library to find offsets of the
> > substrings, then seek in the file to extract them. You can use for
> > example the libre library from Jérôme Vouillon [1]. It only accept
> > strings as input but it would be really easy to make it work on input
> > channels (just replace "s.[pos]" by "input_char ic").
> >
> >   [1] http://sourceforge.net/projects/libre/
> >       https://github.com/avsm/ocaml-re.git
> >
>
> A nice library for regular expression matching is LibTRE (BSD licensed),
> and it has a way to parse arbitrary data with callbacks:
> http://laurikari.net/tre/documentation/reguexec/
>
> According to the paper it is also good at finding substring matches
> with its tagged NFA:
> http://laurikari.net/ville/regex-submatch.pdf
>
> If you don't use back-references (!tre_have_backrefs) then it guarantees
> linear-time matching.
>
> I couldn't find an OCaml wrapper for it, but should be simple enough to
> write one.
>
> Best regards,
> --Edwin
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

[-- Attachment #2: Type: text/html, Size: 3398 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-19  9:08       ` Philippe Veber
@ 2012-03-19 13:44         ` Edgar Friendly
  2012-03-21  7:21           ` Philippe Veber
  0 siblings, 1 reply; 15+ messages in thread
From: Edgar Friendly @ 2012-03-19 13:44 UTC (permalink / raw)
  To: Philippe Veber; +Cc: caml-list

On 03/19/2012 05:08 AM, Philippe Veber wrote:
> Thanks Edgar and Jérémie, this indeed seems to be the right track. I
> just hope that a repeated use of input_char is not 10-100X slower than
> input_line :o).
> ph.
>
Quite true - instead of giving the matcher just a single byte at a time, 
it is more efficient to give it blocks of data, as long as it can keep 
its state from one block to the next.  But its matching internally will 
be on one byte at a time, normally.  I guess with DNA, because of the 
reduced character set, it'd be possible to get each symbol down to 2 
bits (if you're really just using ACGT), in which case, the matcher 
could run 4 basepairs at a time, but there's a lot of corner issues 
doing things that way.  A lot depends on how much time and effort you're 
willing to spend engineering something.

E.

> 2012/3/16 Edgar Friendly <thelema314@gmail.com
> <mailto:thelema314@gmail.com>>
>
>     So given a large file and a line number, you want to:
>     1) extract that line from the file
>     2) produce an enum of all k-length slices of that line?
>     3) match each slice against your regexp set to produce a list/enum
>     of substrings that match the regexps?
>     Without reading the whole line into memory at once.
>
>     I'm with Dimino on the right solution - just use a matcher that that
>     works incrementally, feed it one byte at a time, and have it return
>     a list of match offsets.  Then work backwards from these endpoints
>     to figure out which substrings you want.
>
>     There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
>     should suffice to use (0,k-1) and (k,2k-1) with an incremental
>     matching routine.
>
>     E.
>
>
>
>     On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
>     <philippe.veber@gmail.com <mailto:philippe.veber@gmail.com>> wrote:
>
>         Thank you Edgar for your answer (and also Christophe). It seems
>         my question was a bit misleading: actually I target a subset of
>         regexps whose matching is really trivial, so this is no worry
>         for me. I was more interested in how accessing a large line in a
>         file by chunks of fixed length k. For instance how to build a
>         [Substring.t Enum.t] from some line in a file, without building
>         the whole line in memory. This enum would yield the substrings
>         (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
>         copy/concat operations. I think I can do it myself but I'm not
>         too confident regarding good practices on buffered reads of
>         files. Maybe there are some good examples in Batteries?
>
>         Thanks again,
>            ph.
>
>
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Caml-list] Efficient scanning of large strings from files
  2012-03-19 13:44         ` Edgar Friendly
@ 2012-03-21  7:21           ` Philippe Veber
  0 siblings, 0 replies; 15+ messages in thread
From: Philippe Veber @ 2012-03-21  7:21 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3163 bytes --]

2012/3/19 Edgar Friendly <thelema314@gmail.com>

> On 03/19/2012 05:08 AM, Philippe Veber wrote:
>
>> Thanks Edgar and Jérémie, this indeed seems to be the right track. I
>> just hope that a repeated use of input_char is not 10-100X slower than
>> input_line :o).
>> ph.
>>
>>  Quite true - instead of giving the matcher just a single byte at a time,
> it is more efficient to give it blocks of data, as long as it can keep its
> state from one block to the next.  But its matching internally will be on
> one byte at a time, normally.

Thanks for the confirmation, I now see more clearly what to do.


> I guess with DNA, because of the reduced character set, it'd be possible
> to get each symbol down to 2 bits (if you're really just using ACGT), in
> which case, the matcher could run 4 basepairs at a time, but there's a lot
> of corner issues doing things that way.  A lot depends on how much time and
> effort you're willing to spend engineering something.
>
Maybe not that far yet, but this is something we've mentionned for biocaml.
I guess I could take some inspiration from the bitset module in Batteries.
Anyway thanks everybody for your help!
ph.


>
> E.
>
>  2012/3/16 Edgar Friendly <thelema314@gmail.com
>> <mailto:thelema314@gmail.com>>
>>
>>
>>    So given a large file and a line number, you want to:
>>    1) extract that line from the file
>>    2) produce an enum of all k-length slices of that line?
>>    3) match each slice against your regexp set to produce a list/enum
>>    of substrings that match the regexps?
>>    Without reading the whole line into memory at once.
>>
>>    I'm with Dimino on the right solution - just use a matcher that that
>>    works incrementally, feed it one byte at a time, and have it return
>>    a list of match offsets.  Then work backwards from these endpoints
>>    to figure out which substrings you want.
>>
>>    There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
>>    should suffice to use (0,k-1) and (k,2k-1) with an incremental
>>    matching routine.
>>
>>    E.
>>
>>
>>
>>    On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
>>    <philippe.veber@gmail.com <mailto:philippe.veber@gmail.**com<philippe.veber@gmail.com>>>
>> wrote:
>>
>>        Thank you Edgar for your answer (and also Christophe). It seems
>>        my question was a bit misleading: actually I target a subset of
>>        regexps whose matching is really trivial, so this is no worry
>>        for me. I was more interested in how accessing a large line in a
>>        file by chunks of fixed length k. For instance how to build a
>>        [Substring.t Enum.t] from some line in a file, without building
>>        the whole line in memory. This enum would yield the substrings
>>        (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
>>        copy/concat operations. I think I can do it myself but I'm not
>>        too confident regarding good practices on buffered reads of
>>        files. Maybe there are some good examples in Batteries?
>>
>>        Thanks again,
>>           ph.
>>
>>
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 4346 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-03-21  7:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber
2012-03-16 14:14 ` Edgar Friendly
2012-03-16 14:48   ` Philippe Veber
2012-03-16 17:02     ` Edgar Friendly
2012-03-19  9:08       ` Philippe Veber
2012-03-19 13:44         ` Edgar Friendly
2012-03-21  7:21           ` Philippe Veber
2012-03-16 17:23   ` Francois????Charles Matthieu????Berenger
2012-03-17 16:53     ` oliver
2012-03-19  9:08     ` Philippe Veber
2012-03-16 14:49 ` Jérémie Dimino
2012-03-18 21:11   ` Török Edwin
2012-03-19  9:11     ` Philippe Veber
2012-03-16 20:11 ` oliver
2012-03-18 23:56 ` oliver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).