From mboxrd@z Thu Jan 1 00:00:00 1970 X-Sympa-To: caml-list@inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2J98iWU019290 for ; Mon, 19 Mar 2012 10:08:44 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApIBANX2Zk/RVdQ2kGdsb2JhbABDtkEIIgEBAQEJCQ0HFAQjggkBAQEDAQwGAiwBGx0BAwELBgUEBwMKLiEBAREBBQEcBhMbB4djBZxRCowSgnGENj+IdAEFC4lKX4ZIBJIxgzeLL4MaPYQJgVs X-IronPort-AV: E=Sophos;i="4.73,611,1325458800"; d="scan'208";a="136706487" Received: from mail-vb0-f54.google.com ([209.85.212.54]) by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 19 Mar 2012 10:08:38 +0100 Received: by vbmv11 with SMTP id v11so1099237vbm.27 for ; Mon, 19 Mar 2012 02:08:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=dbhAveBAw8C1KOUVuwqbM0QpySjwUkQwvFP5Cp85zt0=; b=Sr6qeN6iaJgq7sUL7p2MJ6RHRHOKRO739BFDIbYEbKCzpRG8YJhkAE14hKE1JRJTQt hvOb5rZ0iXJ/dGIiaS1krla5hU0QYCB1Y2Ejef8e4wYLQFDBSYK0fcAG6Lae7lYLIeKm NoRUFbp6AUr3zWmp4zw0wU8btAe0pBfBaeh7e+HJu/ORYpWkc6sWhrzDDNh5b0U0f4Yd 5bOXD7b5Z6P8+ULrolR//el3ILoXUKJxyx8RxDTAs7fVLs4h61QrBvvMjY3eo6oVefH4 Wg7akJJqDpVKUetyguK3VOvfzLFnkq0gr/VlUYLZk+S1O8QUtwJfvfag6DjNq0eUIrdE 0zrw== Received: by 10.52.27.11 with SMTP id p11mr5014022vdg.37.1332148117391; Mon, 19 Mar 2012 02:08:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.31.136 with HTTP; Mon, 19 Mar 2012 02:08:17 -0700 (PDT) In-Reply-To: References: <4F634AD1.20402@gmail.com> From: Philippe Veber Date: Mon, 19 Mar 2012 10:08:17 +0100 Message-ID: To: Edgar Friendly Cc: "caml-list@inria.fr" Content-Type: multipart/alternative; boundary=20cf307d0102534a8104bb94e733 X-Validation-by: philippe.veber@gmail.com Subject: Re: [Caml-list] Efficient scanning of large strings from files --20cf307d0102534a8104bb94e733 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I ju= st hope that a repeated use of input_char is not 10-100X slower than input_line :o). ph. 2012/3/16 Edgar Friendly > So given a large file and a line number, you want to: > 1) extract that line from the file > 2) produce an enum of all k-length slices of that line? > 3) match each slice against your regexp set to produce a list/enum of > substrings that match the regexps? > Without reading the whole line into memory at once. > > I'm with Dimino on the right solution - just use a matcher that that works > incrementally, feed it one byte at a time, and have it return a list of > match offsets. Then work backwards from these endpoints to figure out > which substrings you want. > > There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it > should suffice to use (0,k-1) and (k,2k-1) with an incremental matching > routine. > > E. > > > > On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber > wrote: > >> Thank you Edgar for your answer (and also Christophe). It seems my >> question was a bit misleading: actually I target a subset of regexps who= se >> matching is really trivial, so this is no worry for me. I was more >> interested in how accessing a large line in a file by chunks of fixed >> length k. For instance how to build a [Substring.t Enum.t] from some line >> in a file, without building the whole line in memory. This enum would yi= eld >> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many >> string copy/concat operations. I think I can do it myself but I'm not too >> confident regarding good practices on buffered reads of files. Maybe the= re >> are some good examples in Batteries? >> >> Thanks again, >> ph. >> >> >> > --20cf307d0102534a8104bb94e733 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Edgar and J=E9r=E9mie, this indeed seems to be the right track. I ju= st hope that a repeated use of input_char is not 10-100X slower than input_= line :o).
ph.

2012/3/16 Edgar Friendly= <thelema314@g= mail.com>
So given a large file and a line number, you= want to:
1) extract that line from the file
2) produce an enum of al= l k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum of subst= rings that match the regexps?
Without reading the whole line into memory at once.=A0

I'm with= Dimino on the right solution - just use a matcher that that works incremen= tally, feed it one byte at a time, and have it return a list of match offse= ts.=A0 Then work backwards from these endpoints to figure out which substri= ngs you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - i= t should suffice to use (0,k-1) and (k,2k-1) with an incremental matching r= outine.

E.



On Fri, Mar 16, 20= 12 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com> wrote:
Thank you Edgar for your answer (and also Ch= ristophe). It seems my question was a bit misleading: actually I target a s= ubset of regexps whose matching is really trivial, so this is no worry for = me. I was more interested in how accessing a large line in a file by chunks= of fixed length k. For instance how to build a [Substring.t Enum.t] from s= ome line in a file, without building the whole line in memory. This enum wo= uld yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too= many string copy/concat operations. I think I can do it myself but I'm= not too confident regarding good practices on buffered reads of files. May= be there are some good examples in Batteries?

Thanks again,
=A0 ph.




--20cf307d0102534a8104bb94e733--