caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Subsequence references or substrings in OCaml
@ 1998-12-18 19:27 Brian Rogoff
  1998-12-21 10:39 ` Christian Lindig
  1998-12-21 15:49 ` Michael Hicks
  0 siblings, 2 replies; 4+ messages in thread
From: Brian Rogoff @ 1998-12-18 19:27 UTC (permalink / raw)
  To: caml-list

Hi,

I've been doing a lot of string processing lately and trying to determine
the correct "functional" approach to string processing. In SML '97 (hey,
you've got to look at the competition ;-) the Basis Library has a
signature SUBSTRING which has a substring type, whose values are just 
triples (s,i,n) where i is the starting index and n the size of the string
s. It turns out that many algorithms are simplified by using this kind of 
representation for strings. 

It also turns out that similar string representations have been used
before; if you hop over to http://www.cs.cmu.edu/~wjh/papers/ you'll see
that "subsequence references" are pretty much the same thing. 

Its simple enough to implement subsequence references as a user defined
type in OCaml, as I've done, but I am curious about whether anyone else
who has used similar libraries would find a built-in substring or 
subsequence ref library useful. 

-- Brian





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Subsequence references or substrings in OCaml
  1998-12-18 19:27 Subsequence references or substrings in OCaml Brian Rogoff
@ 1998-12-21 10:39 ` Christian Lindig
  1998-12-21 15:49 ` Michael Hicks
  1 sibling, 0 replies; 4+ messages in thread
From: Christian Lindig @ 1998-12-21 10:39 UTC (permalink / raw)
  To: Brian Rogoff; +Cc: Caml Mailing List

> Its simple enough to implement subsequence references as a user defined
> type in OCaml, as I've done, but I am curious about whether anyone else
> who has used similar libraries would find a built-in substring or 
> subsequence ref library useful. 

[sorry, no french version]

String processing includes scanning and splitting strings and is often
done using regular expressions.  Inspired by parser combinators I have
written a light weight library of `lexer combinators' which uses an
efficient internal representation of substrings:  a substring is a
(int * int) pair.  The first element denotes the index of the first
character and the second the length of the substring.  The interface
of the module is attached - mail me for the code.

-- Christian

-- 
Christian Lindig    Technische Universitaet Braunschweig, Germany
                    http://www.cs.tu-bs.de/softech/people/lindig
                    mailto:lindig@ips.cs.tu-bs.de

(*  ------------------------------------------------------------------ 

    $Id: lc.mli,v 1.3 1998/11/24 19:33:17 lindig Exp $
    
    This module provides lexer combinators (LCs). They can be used for
    similar purposes as regular expressions: verifying that a string
    obeys a specified syntax and extracting parts from it.  LCs are
    not as powerful as regular expression (REs) but do not require
    preprocessing like REs.
  
    The implementation aims to be reasonable efficient by using
    indices into the scanned string during the whole scan. However,
    scanning of long strings (> 1000 chars) should be avoided.       
    ------------------------------------------------------------------ *)
   
exception Error of string
(* [Error] reports the failure of a lexer with a message which is not
   very descriptive but may help finding unexpected failures *)

type region = int * int
(* While a string is scanned so-called regions (substrings) from it
   can be saved to extract them later.  A region consists of the index
   of its first letter and its length; the index of a string's first
   character is 0:

        The region (5,3) of "Objective Caml" is "tiv".
                             012345
                                  123 
                                  
   A region is always relative to a string *)

type lexer

(* A lexer scans a string to check that it matches a syntactic
   structure. It fails when the string does not match the
   expected syntactic structure *)

val succeed : lexer
(* [succeed] is a lexer that always succeeds. *) 

val fail : string -> 'a
(* [fail] always fails with a descriptve error message *)

val any : lexer
(* [any] consumes just one character when availabe and fails when the
   end of string is reached *)

val eof : lexer
(* [eof] succeeds when the end of input (i.e.  string) is reached.  It
   does not consume any character *)

val satisfy : (char -> bool) -> lexer
(* [satisfy f] consumes the next characters successfully when it
   satisfies predicate f *)

val chr : char -> lexer
(* [chr c] consumes the next character when it equals [c]; fails
   otherwise *)

val str : string -> lexer
(* [str s] succeeds when the input starts with string [s] - in this case
   it consumes as many characters as [s] is long. *)

val ( *** ) : lexer -> lexer -> lexer
(* [x *** y] is a lexer that first uses lexer [x] and then scans the
   remaining input using lexer [y]. It fails when any of [x] and [y] fail
   *)

val ( ||| ) : lexer -> lexer -> lexer
(* [x ||| y] scans the input using lexer [x]; if [x] fails it the
   input again using [y]. The lexer fails when [y] fails. *)

val many : lexer -> lexer
(* [many l] scans the input using lexer [l] as many (including zero) times as
   possible; always succeeds. Beware: [many] is greedy; i.e. [many
   any *** chr 'x'] will always fail because the `x' will be consumed
   by [many any]. *)

val some : lexer -> lexer
(* [some l] scans the input using lexer one or more times. [some] is
   greedy -- see also [many] *)

val opt : lexer -> lexer
(* [opt l] scans the input using lexer [l] when possible; when [l]
   fails the lexer does not consume input and succeeds. *)

val save : lexer -> lexer
(* [save l] scans the input using lexer l and stores the region l
   successfully scanned into the region list. It can later be used to
   extract the substring lexer [l] has sucessfully scanned. The lexer fails
   whenever [l] fails. The resulting region list is sorted by the start
   points of its regions: 

   Given a string where the following 4 regions are saved (regions can
   be nested):
        
       (1  (2   2)  (3  (4 4) 3) 1)

   In the resulting region list the order of regions will be: 
   
        1 2 3 4

   In the example above regions are marked by parentheses. Of couse
   they are still represented as (index, lengh) pairs as described
   above. 

*)

val extract : region list -> string -> string list
(* [extract l str] extracts all substrings from the region list [l] from
   string [str]. *)

val scan: string -> lexer -> (int * region list)
(* [scan str l] scans string [str] using lexer [l]. The result is a
   pair: the number of characters successfully consumed and the list of
   stored regions. [scan] raises [Error] when scanning fails. The regions
   from the region list can be passed to [extract] for extracting all
   regions as substrings from [str] *)

val scanAndExtract : string -> lexer -> string list
(* [scanAndExtract] scans using [lexer]. On successfull scan the list
   of saved substrings (see [save]) is returned. [Error] is raised when
   the string does not match *)
                   




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Subsequence references or substrings in OCaml
  1998-12-18 19:27 Subsequence references or substrings in OCaml Brian Rogoff
  1998-12-21 10:39 ` Christian Lindig
@ 1998-12-21 15:49 ` Michael Hicks
  1998-12-27 21:03   ` Mark Hayden
  1 sibling, 1 reply; 4+ messages in thread
From: Michael Hicks @ 1998-12-21 15:49 UTC (permalink / raw)
  To: Brian Rogoff, caml-list

At 11:27 AM -0800 12/18/98, Brian Rogoff wrote:
>Its simple enough to implement subsequence references as a user defined
>type in OCaml, as I've done, but I am curious about whether anyone else
>who has used similar libraries would find a built-in substring or
>subsequence ref library useful.

We've done this in our active network implementation to efficiently
implement packet switching.  The idea is that when the packet comes in, it
has an Ethernet header that is stripped off before the packet is processed
by the next level in the protocol stack.  After doing this, it may turn out
that the packet should be forwarded, in which case a new Ethernet header
needs to be prepended to the currently headerless packet.  By using the
substring abstraction, we are able to the prepend efficiently because the
space of the old header may be overwritten.  The alternative would have us
allocate a new buffer with the correct header and then copy the old payload
into it.  Using substrings is more efficient: there is no wasted copying of
data; this is a time-savings in itself, but it also reduces the frequency
of GC.

Mike

------------------------------------------------------------------------------
Michael Hicks
Ph.D. candidate, the University of Pennsylvania
mwh@gradient.cis.upenn.edu
"I'm looking over
 A three-leaf clover,
 That I overlooked be-three ..." -- Bugs Bunny





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Subsequence references or substrings in OCaml
  1998-12-21 15:49 ` Michael Hicks
@ 1998-12-27 21:03   ` Mark Hayden
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Hayden @ 1998-12-27 21:03 UTC (permalink / raw)
  To: Michael Hicks; +Cc: Brian Rogoff, caml-list

You'll also find support for management of sub-strings
in the Ensemble group communication toolkit written in Ocaml.
Most of the data that moves through our protocols is
message data contained in 'substring' data structures
and so their implementation turns out to heavily impact the
performance of the Ensemble protocols.  The substring
implementation is described in a section of the paper
"Distributed Communication in ML", which can be found at

  http://simon.cs.cornell.edu/Info/People/hayden/publications.htm

The "substring" module are in the buffer subdirectory
of the Ensemble distribution.

--Mark

Michael Hicks wrote:
> 
> At 11:27 AM -0800 12/18/98, Brian Rogoff wrote:
> >Its simple enough to implement subsequence references as a user defined
> >type in OCaml, as I've done, but I am curious about whether anyone else
> >who has used similar libraries would find a built-in substring or
> >subsequence ref library useful.
> 
> We've done this in our active network implementation to efficiently
> implement packet switching.  The idea is that when the packet comes in, it
> has an Ethernet header that is stripped off before the packet is processed
> by the next level in the protocol stack.  After doing this, it may turn out
> that the packet should be forwarded, in which case a new Ethernet header
> needs to be prepended to the currently headerless packet.  By using the
> substring abstraction, we are able to the prepend efficiently because the
> space of the old header may be overwritten.  The alternative would have us
> allocate a new buffer with the correct header and then copy the old payload
> into it.  Using substrings is more efficient: there is no wasted copying of
> data; this is a time-savings in itself, but it also reduces the frequency
> of GC.
> 
> Mike
> 
> ------------------------------------------------------------------------------
> Michael Hicks
> Ph.D. candidate, the University of Pennsylvania
> mwh@gradient.cis.upenn.edu
> "I'm looking over
>  A three-leaf clover,
>  That I overlooked be-three ..." -- Bugs Bunny


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~1998-12-29 13:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-18 19:27 Subsequence references or substrings in OCaml Brian Rogoff
1998-12-21 10:39 ` Christian Lindig
1998-12-21 15:49 ` Michael Hicks
1998-12-27 21:03   ` Mark Hayden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).