2012/3/16 Edgar Friendly <thelema314@gmail.com>

So given a large file and a line number, you want to:
1) extract that line from the file
2) produce an enum of all k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum of substrings that match the regexps?
Without reading the whole line into memory at once.

I'm with Dimino on the right solution - just use a matcher that that works incrementally, feed it one byte at a time, and have it return a list of match offsets. Then work backwards from these endpoints to figure out which substrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine.

E.

On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com> wrote:

Thank you Edgar for your answer (and also Christophe). It seems my question was a bit misleading: actually I target a subset of regexps whose matching is really trivial, so this is no worry for me. I was more interested in how accessing a large line in a file by chunks of fixed length k. For instance how to build a [Substring.t Enum.t] from some line in a file, without building the whole line in memory. This enum would yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many string copy/concat operations. I think I can do it myself but I'm not too confident regarding good practices on buffered reads of files. Maybe there are some good examples in Batteries?

Thanks again,
ph.