caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Regular expression library: a poll on features
@ 2002-07-05 14:13 Xavier Leroy
  2002-07-05 15:56 ` Pixel
  2002-07-15 15:52 ` [Caml-list] " Xavier Leroy
  0 siblings, 2 replies; 4+ messages in thread
From: Xavier Leroy @ 2002-07-05 14:13 UTC (permalink / raw)
  To: caml-list

Dear OCaml users,

There is a consensus that the current OCaml regexp library (Str) is
not satisfactory, for various reasons: not robust enough; slow on
certain regexps; non-reentrant API; and restrictive license.

We are therefore investigating alternative regexp libraries, with the
idea of providing both an emulation of the Str interface (for backward
compatibility) and a better interface (for new programs).  Several
candidates exist, most notably PCRE and Jérôme Vouillon's RE library.

One issue on which I'd like your opinion is that none of these
candidates support both features below:

1- back-references in regexps (e.g. "([a-z]+)\1", meaning any sequence
of lowercase letters followed by another occurrence of the same sequence);

2- partial string matching as per Str.string_partial_match, i.e.
the ability to recognize that a string is a prefix of a string that
match a regexp.

What I'd like to know is: do you use any of these features?  Please
reply to me personally -- there's no need to bother the list with this
-- and I'll post a summary of the replies.

Thanks for your input,

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Regular expression library: a poll on features
  2002-07-05 14:13 [Caml-list] Regular expression library: a poll on features Xavier Leroy
@ 2002-07-05 15:56 ` Pixel
  2002-07-15 15:52 ` [Caml-list] " Xavier Leroy
  1 sibling, 0 replies; 4+ messages in thread
From: Pixel @ 2002-07-05 15:56 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

Xavier Leroy <xavier.leroy@inria.fr> writes:

(posting on the list cuz of the wishlist below)

> 1- back-references in regexps (e.g. "([a-z]+)\1", meaning any sequence
> of lowercase letters followed by another occurrence of the same sequence);

IMO only usefull in few cases. Could be dropped with no pb.

> 
> 2- partial string matching as per Str.string_partial_match, i.e.
> the ability to recognize that a string is a prefix of a string that
> match a regexp.

quite a surprising feature. I can't imagine an interesting example of
its use.


Here is my wishlist to have regexps easily _usable_
(ie. fill the gap with perl/ruby)

- with compiler support (a la printf's format):

  flags = [ `CASELESS | `MULTILINE | `DOTALL | `EXTENDED | `ANCHORED ]
  m : ?option:flags -> regexp -> string -> string tuple option
  match_all : ?option:flags -> regexp -> string -> string tuple list
  cond_match : ?option:flags -> string -> regexp -> (string tuple -> 'a)
                                       -> regexp -> (string tuple -> 'a)
                                          ...
                                       -> 'a
           (with checking the last regexp is always true unless 'a is unit)
  subst     : string -> regexp -> (string tuple -> string) -> string
  gsubst    : string -> regexp -> (string tuple -> string) -> string
  try_subst : string -> regexp -> (string tuple -> string) -> string option

examples

    match Re.m "(\w+)=(\S+)" s with
    | None -> ...
    | Some(v, val) -> ...
    
    Re.cond_match s
      "^\s*#" (fun() -> None)
      "(\w+)=(\S+)" (fun (v, val) -> Some(v, val))
      "" (fun() -> failwith ("bad line " ^ s))
    
    Re.cond_match s "warning: (.*)" (fun s -> eprintf "a problem occured: %s\n" s)

    Re.subst (Re.subst s "^\s+" (fun() -> "")) "\s+$" (fun() -> "")


- Without compiler support

  flags = [ `CASELESS | `MULTILINE | `DOTALL | `EXTENDED | `ANCHORED ]
  re : string -> regexp
  m : ?option:flags -> regexp -> string -> string list option
  match_all : ?option:flags -> regexp -> string -> string list list
  cond_match : ?option:flags -> string -> (regexp * (string list -> 'a)) list -> 'a
  subst     : string -> regexp -> (string list -> string) -> string
  gsubst    : string -> regexp -> (string list -> string) -> string
  try_subst : string -> regexp -> (string list -> string) -> string option

examples

    match m (re"(\w+)=(\S+)") s with
    | Some [ v ; val ] -> ...
    | _ -> ...
    
    cond_match s [ 
        re"^\s*#", fun() -> None ;
        re"(\w+)=(\S+)", fun (v, val) -> Some(v, val) ;
        re"", fun() -> failwith ("bad line " ^ s) ;
    ]
    
    cond_match s "warning: (.*)" (fun s -> eprintf "a problem occured: %s\n" (hd s))

    subst (subst s (re"^\s+") (fun _ -> "")) (re"\s+$") (fun _ -> "")
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] Re: Regular expression library: a poll on features
  2002-07-05 14:13 [Caml-list] Regular expression library: a poll on features Xavier Leroy
  2002-07-05 15:56 ` Pixel
@ 2002-07-15 15:52 ` Xavier Leroy
  2002-07-15 18:41   ` Matt Armstrong
  1 sibling, 1 reply; 4+ messages in thread
From: Xavier Leroy @ 2002-07-15 15:52 UTC (permalink / raw)
  To: caml-list

As promised, here is a summary of the replies I got concerning the
"feature poll" for OCaml's regexp library.

Feature 1: back-references in regexps (e.g. "([a-z]+)\1", meaning any sequence
of lowercase letters followed by another occurrence of the same
sequence)

     use often                  4
     use occasionally           3
     no use                     5

Feature 2: partial string matching as per Str.string_partial_match, i.e.
the ability to recognize that a string is a prefix of a string that
match a regexp.

     has already used           0
     could use in some cases    6
     no use                     8

A few more remarks on these two features, and why I asked about them
in particular.  

Feature 1 is standard in many regexp packages because it's trivial to
implement in a backtracking regexp matcher.  However, it's essentially
impossible to implement if the regexp is compiled down to a
deterministic automata.  Thus, supporting feature 1 precludes
DFA-based implementations such as Jérôme Vouillon's RE library.

Feature 2 is unusual and I haven't heard from anyone that uses it :-)
I got two replies suggesting one plausible scenario where partial
matching could come handy: find delimiters in a piece of text that
is being read block by block.  However, I'm not sure
Str.string_partial_match is adequate here, it looks like a
"search forward for a partial match" operation is needed, which Str
doesn't provide...

It was also suggested to me that the effect of partial matching
against a regexp R can be achieved by exact matching against a regexp
R' derived from R.  This is true for "textbook regexps", e.g. if R is
"ab*c", then R' would be "epsilon|a(epsilon|b*(epsilon|c))",
but doesn't work for more complex regexps languages, especially if
back-references are supported.  (Consider R = "(a+)\1".)

Thanks for your input,

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] Re: Regular expression library: a poll on features
  2002-07-15 15:52 ` [Caml-list] " Xavier Leroy
@ 2002-07-15 18:41   ` Matt Armstrong
  0 siblings, 0 replies; 4+ messages in thread
From: Matt Armstrong @ 2002-07-15 18:41 UTC (permalink / raw)
  To: caml-list

Xavier Leroy <xavier.leroy@inria.fr> writes:

[...]

> Feature 2: partial string matching as per Str.string_partial_match, i.e.
> the ability to recognize that a string is a prefix of a string that
> match a regexp.
>
>      has already used           0
>      could use in some cases    6
>      no use                     8

[...]

> Feature 2 is unusual and I haven't heard from anyone that uses it
> :-) I got two replies suggesting one plausible scenario where
> partial matching could come handy: find delimiters in a piece of
> text that is being read block by block.  However, I'm not sure
> Str.string_partial_match is adequate here, it looks like a "search
> forward for a partial match" operation is needed, which Str doesn't
> provide...

This is how a MIME message parser I wrote worked (written in a
scripting language that made byte-by-byte string comparisons more
costly than regexps).  The parser read in the message chunk by chunk.
I had a list of regexps representing the current set of MIME
boundaries, and I was interested if the last N bytes of the current
chunk ended with a (possibly partial) match of each regexp.  If there
was a match and it wasn't complete, you have to deal with a MIME
boundary that might cross a chunk boundary.


> It was also suggested to me that the effect of partial matching
> against a regexp R can be achieved by exact matching against a
> regexp R' derived from R.  This is true for "textbook regexps",
> e.g. if R is "ab*c", then R' would be
> "epsilon|a(epsilon|b*(epsilon|c))", but doesn't work for more
> complex regexps languages, especially if back-references are
> supported.  (Consider R = "(a+)\1".)

And in the MIME parser, this is what I did -- since the regexps were
simple.

In Ocaml, I'm not sure I would use regexps for this at all since (I
assume) comparing strings "by hand" would be fast.
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-07-15 18:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-05 14:13 [Caml-list] Regular expression library: a poll on features Xavier Leroy
2002-07-05 15:56 ` Pixel
2002-07-15 15:52 ` [Caml-list] " Xavier Leroy
2002-07-15 18:41   ` Matt Armstrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).