From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id SAA19348 for caml-redistribution; Mon, 21 Dec 1998 18:07:42 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA03858 for ; Mon, 21 Dec 1998 11:39:25 +0100 (MET) Received: from infbssys.ips.cs.tu-bs.de (infbssys.ips.cs.tu-bs.de [134.169.32.1]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id LAA18431 for ; Mon, 21 Dec 1998 11:39:19 +0100 (MET) Received: from infbsstq.ips.cs.tu-bs.de (lindig@infbsstq.ips.cs.tu-bs.de [134.169.32.78]) by infbssys.ips.cs.tu-bs.de (8.9.1/8.9.1) with ESMTP id LAA23119; Mon, 21 Dec 1998 11:39:18 +0100 Message-ID: <19981221113917.B3555@ips.cs.tu-bs.de> Date: Mon, 21 Dec 1998 11:39:17 +0100 From: Christian Lindig To: Brian Rogoff Cc: Caml Mailing List Subject: Re: Subsequence references or substrings in OCaml Mail-Followup-To: Brian Rogoff , Caml Mailing List References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93i In-Reply-To: ; from Brian Rogoff on Fri, Dec 18, 1998 at 11:27:57AM -0800 Sender: weis > Its simple enough to implement subsequence references as a user defined > type in OCaml, as I've done, but I am curious about whether anyone else > who has used similar libraries would find a built-in substring or > subsequence ref library useful. [sorry, no french version] String processing includes scanning and splitting strings and is often done using regular expressions. Inspired by parser combinators I have written a light weight library of `lexer combinators' which uses an efficient internal representation of substrings: a substring is a (int * int) pair. The first element denotes the index of the first character and the second the length of the substring. The interface of the module is attached - mail me for the code. -- Christian -- Christian Lindig Technische Universitaet Braunschweig, Germany http://www.cs.tu-bs.de/softech/people/lindig mailto:lindig@ips.cs.tu-bs.de (* ------------------------------------------------------------------ $Id: lc.mli,v 1.3 1998/11/24 19:33:17 lindig Exp $ This module provides lexer combinators (LCs). They can be used for similar purposes as regular expressions: verifying that a string obeys a specified syntax and extracting parts from it. LCs are not as powerful as regular expression (REs) but do not require preprocessing like REs. The implementation aims to be reasonable efficient by using indices into the scanned string during the whole scan. However, scanning of long strings (> 1000 chars) should be avoided. ------------------------------------------------------------------ *) exception Error of string (* [Error] reports the failure of a lexer with a message which is not very descriptive but may help finding unexpected failures *) type region = int * int (* While a string is scanned so-called regions (substrings) from it can be saved to extract them later. A region consists of the index of its first letter and its length; the index of a string's first character is 0: The region (5,3) of "Objective Caml" is "tiv". 012345 123 A region is always relative to a string *) type lexer (* A lexer scans a string to check that it matches a syntactic structure. It fails when the string does not match the expected syntactic structure *) val succeed : lexer (* [succeed] is a lexer that always succeeds. *) val fail : string -> 'a (* [fail] always fails with a descriptve error message *) val any : lexer (* [any] consumes just one character when availabe and fails when the end of string is reached *) val eof : lexer (* [eof] succeeds when the end of input (i.e. string) is reached. It does not consume any character *) val satisfy : (char -> bool) -> lexer (* [satisfy f] consumes the next characters successfully when it satisfies predicate f *) val chr : char -> lexer (* [chr c] consumes the next character when it equals [c]; fails otherwise *) val str : string -> lexer (* [str s] succeeds when the input starts with string [s] - in this case it consumes as many characters as [s] is long. *) val ( *** ) : lexer -> lexer -> lexer (* [x *** y] is a lexer that first uses lexer [x] and then scans the remaining input using lexer [y]. It fails when any of [x] and [y] fail *) val ( ||| ) : lexer -> lexer -> lexer (* [x ||| y] scans the input using lexer [x]; if [x] fails it the input again using [y]. The lexer fails when [y] fails. *) val many : lexer -> lexer (* [many l] scans the input using lexer [l] as many (including zero) times as possible; always succeeds. Beware: [many] is greedy; i.e. [many any *** chr 'x'] will always fail because the `x' will be consumed by [many any]. *) val some : lexer -> lexer (* [some l] scans the input using lexer one or more times. [some] is greedy -- see also [many] *) val opt : lexer -> lexer (* [opt l] scans the input using lexer [l] when possible; when [l] fails the lexer does not consume input and succeeds. *) val save : lexer -> lexer (* [save l] scans the input using lexer l and stores the region l successfully scanned into the region list. It can later be used to extract the substring lexer [l] has sucessfully scanned. The lexer fails whenever [l] fails. The resulting region list is sorted by the start points of its regions: Given a string where the following 4 regions are saved (regions can be nested): (1 (2 2) (3 (4 4) 3) 1) In the resulting region list the order of regions will be: 1 2 3 4 In the example above regions are marked by parentheses. Of couse they are still represented as (index, lengh) pairs as described above. *) val extract : region list -> string -> string list (* [extract l str] extracts all substrings from the region list [l] from string [str]. *) val scan: string -> lexer -> (int * region list) (* [scan str l] scans string [str] using lexer [l]. The result is a pair: the number of characters successfully consumed and the list of stored regions. [scan] raises [Error] when scanning fails. The regions from the region list can be passed to [extract] for extracting all regions as substrings from [str] *) val scanAndExtract : string -> lexer -> string list (* [scanAndExtract] scans using [lexer]. On successfull scan the list of saved substrings (see [save]) is returned. [Error] is raised when the string does not match *)