caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Some sugar for regexp matching using camlp4
@ 2001-07-16 15:54 Francois Pottier
  2001-07-16 17:37 ` Alexander V. Voinov
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Francois Pottier @ 2001-07-16 15:54 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1613 bytes --]


Hello all,

I have experimented a bit with custom syntax for regular expression
matching. My goal was to implement some high-level constructs on top
of a low-level regexp library such as PCRE. The result of my (modest)
experiment is attached. It is a camlp4 grammar extension, which allows
writing

  extract x, y, ... matching e against r in e'

The semantics is as follows. The expression e is evaluated, yielding
a string which is matched against the regular expression r. r must be
either a constant string, or a compiled regular expression; if the
former, pre-compilation code is inserted transparently. The variables
x, y, ... etc. are then bound to the appropriate groups (i.e. x is
bound to the sub-string which matched the whole pattern, y is bound
to the sub-string which matched the first group, etc.) and can be
referred to within e'. Wildcards _ can be used instead of variables.

This is of course pretty modest, but it seems that, with a small
number of such constructs, O'Caml could be turned into a rather nice
textual manipulation language. (Something often requested on this
list.) Opinions and further suggestions are welcome.

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/

Here's how to use the syntax extension:

1. Compile it:

  ocamlc -pp "camlp4o -I `camlp4o -where`" -I `camlp4o -where` -c pcreg.ml

2. At the beginning of your source files, insert

  #load "pcreg.cmo";;

3. Compile your source files using the following option:

  -pp "camlp4o -I ."

   (in addition to any options necessary to include the PCRE library,
    e.g. -I +contrib).


[-- Attachment #2: pcreg.ml --]
[-- Type: text/plain, Size: 5147 bytes --]

(* $Header: /net/pauillac/caml/repository/bigbro/pcreg.ml,v 1.1 2001/07/16 15:04:04 fpottier Exp $ *)

open Pcaml

#load "pa_extend.cmo";;
#load "q_MLast.cmo";;

(* ----------------------------------------------------------------------------------------------------------------- *)
(* We begin with an internal utility: a global variable generator, which can be called within grammar rules.

   The global variables receive names numbered in a linear fashion. There is a possibility of name clashes
   if another module, which uses the same name generator, is ``opened'' and that module does not have a
   [.mli] file. It is recommended to always use [.mli] files to describe module interfaces, so these
   internal variable names will not be exported. *)

(* This global variable is used to accumulate global variable declarations while the parser is running. *)

let globals =
  ref []

(* This function allows registering a new global declaration. It can be called within a grammar rule. *)

let declare (item : MLast.str_item) =
  globals := (item, (0, 0) (* dummy location *)) :: !globals

(* This function is used to generate a fresh identifier. *)

let generate =
  let count = ref 0 in
  fun () ->
    incr count;
    Printf.sprintf "_regexp_%d" !count

(* This hook, which is called once per implementation file, adds the global declarations generated by calls
   to [declare] at the beginning of the module. *)

let _ = EXTEND
  implem: FIRST
    [[ (sil, stopped) = NEXT ->
       let extra = !globals in
       globals := [];
       (extra @ sil, stopped)
     ]];
END

(* ----------------------------------------------------------------------------------------------------------------- *)
(* This auxiliary function allows generating code for assertions.

   [assert] is dealt with as a kind of special-purpose syntax extension in O'Caml. However, code in quotations must
   be expressed in plain (righteous) syntax, which means that it cannot use [assert] directly. Hence, we must use
   this code (taken from [camlp4]'s [pa_o.ml]) to generate assertions.

   Note that the generated code depends on the value of [camlp4]'s [-noassert] option. This option is distinct
   from [ocaml]'s own [-noassert] option. *)

let make_assert loc e =
  let f = <:expr< $str:!Pcaml.input_file$ >> in
  let bp = <:expr< $int:string_of_int (fst loc)$ >> in
  let ep = <:expr< $int:string_of_int (snd loc)$ >> in
  let raiser = <:expr< raise (Assert_failure ($f$, $bp$, $ep$)) >> in
  if !Pcaml.no_assert
  then <:expr< () >>
  else <:expr< if $e$ then () else $raiser$ >>

(* ----------------------------------------------------------------------------------------------------------------- *)
(* We continue with syntactic extensions which allow dealing with regular expressions easily.

   The syntax

     extract s0, s1, ..., sk matching e against r in e'

   evaluates the expression [e], matches its value against the regular expression [r] using [Pcre.exec], and binds the
   substrings thus obtained to the patterns [s0], [s1], ..., [sk]. (Each [si] must be either a variable or the
   wildcard pattern [_].) [Pcre.exec] raises [Not_found] if it doesn't match. The code also contains a dynamic check
   (using [assert]) which ensures that the number of extracted substrings, namely $k+1$, is consistent with the
   supplied regular expression. Lastly, the expression [r] must be either a string constant, or a compiled regular
   expression. If the former, the string is pre-compiled (using a global declaration) into a regular expression. *)

let _ = EXTEND
  GLOBAL: expr;
  expr: LEVEL "expr1"
    [[ (p, e, r, l) = [ "extract"; p = LIST1 simplepat SEP ","; "matching"; e = expr; "against"; r = expr ->
                        (p, e, r, loc) ]; (* anonymous sub-rule allows extracting partial location [l] *)
       "in"; body = expr LEVEL "top" ->

	 (* If the regular expression is a string constant, generate pre-compilation code for it. *)

	 let r = match r with
	 | <:expr< $str:s$ >> ->
	     let name = generate() in
	     declare <:str_item< value $lid:name$ = Pcre.regexp $str:s$ >>;
	     <:expr< $lid:name$ >>
	 | _ ->
	     r in

	 (* Wrap bindings for the substrings around the declaration's body. *)

	 let body, _ = List.fold_left (fun (body, index) name ->
	   begin
	     match name with
	     | Some name ->
		 <:expr<
	           let $lid:name$ = Pcre.get_substring _substrings $int:(string_of_int index)$ in
		   $body$
	         >>
	     | None ->
		 body
	   end, index + 1
	 ) (body, 0) p in

	 (* Wrap a dynamic check around the code thus obtained, to ensure that the number of substrings
	    extracted out of the pattern is correct. *)

	 let condition = <:expr< Pcre.num_of_subs _substrings = $int:(string_of_int (List.length p))$ >> in
	 let assertion = make_assert l condition in

	 let body = <:expr< 
	   do {
	     $assertion$;
	     $body$
	   }
	 >> in

	 (* Wrap the actual pattern matching instruction around the code thus obtained. *)

	 <:expr<
	   let _substrings = Pcre.exec ~rex:$r$ $e$ in
	   $body$
	 >>

    ]]
  ;
  simplepat:
    [[ x = LIDENT -> Some x
     | "_"        -> None ]]
  ;
END


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
@ 2001-07-16 17:37 ` Alexander V. Voinov
  2001-07-17  2:36   ` Brian Rogoff
  2001-07-17 10:36 ` Markus Mottl
  2001-07-17 11:45 ` Michel Schinz
  2 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Voinov @ 2001-07-16 17:37 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

Hi Francois,

Francois Pottier wrote:

>   extract x, y, ... matching e against r in e'
> This is of course pretty modest, but it seems that, with a small
> number of such constructs, O'Caml could be turned into a rather nice
> textual manipulation language. (Something often requested on this
> list.) Opinions and further suggestions are welcome.

It would be great. The first question upon the announcement itself (I didn't
yet played with the extension): what does it do when the match fails? Raises
an exception?

Alexander


-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-16 17:37 ` Alexander V. Voinov
@ 2001-07-17  2:36   ` Brian Rogoff
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Rogoff @ 2001-07-17  2:36 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

Francois Pottier wrote:
 
> This is of course pretty modest, but it seems that, with a small
> number of such constructs, O'Caml could be turned into a rather nice
> textual manipulation language. (Something often requested on this
> list.) Opinions and further suggestions are welcome.

I'd be interested in seeing a sugar'ed version of SNOBOL4 or SPITBOL for 
string processing embedded in OCaml. 

-- Brian


-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
  2001-07-16 17:37 ` Alexander V. Voinov
@ 2001-07-17 10:36 ` Markus Mottl
  2001-07-17 12:15   ` Francois Pottier
  2001-07-17 11:45 ` Michel Schinz
  2 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2001-07-17 10:36 UTC (permalink / raw)
  To: Francois Pottier; +Cc: caml-list

On Mon, 16 Jul 2001, Francois Pottier wrote:
> I have experimented a bit with custom syntax for regular expression
> matching. My goal was to implement some high-level constructs on top
> of a low-level regexp library such as PCRE. The result of my (modest)
> experiment is attached. It is a camlp4 grammar extension, which allows
> writing

Nice! This example could surely be used to build a convenient special
purpose language for text manipulation.

>   extract x, y, ... matching e against r in e'
> 
> The semantics is as follows. The expression e is evaluated, yielding
> a string which is matched against the regular expression r. r must be
> either a constant string, or a compiled regular expression; if the
> former, pre-compilation code is inserted transparently.

Note that it should be possible to assert the required number of subgroups
at compile-time if the user supplied a constant string: you'd only have
to compile the pattern string to a regexp within the camlp4-rule and
check things there. This would even allow you to catch illegal patterns:
static typing for regular expression :-)

Best regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
  2001-07-16 17:37 ` Alexander V. Voinov
  2001-07-17 10:36 ` Markus Mottl
@ 2001-07-17 11:45 ` Michel Schinz
  2001-07-17 12:18   ` Francois Pottier
  2 siblings, 1 reply; 10+ messages in thread
From: Michel Schinz @ 2001-07-17 11:45 UTC (permalink / raw)
  To: caml-list

Francois Pottier <Francois.Pottier@inria.fr> writes:

> Hello all,

[...]

> This is of course pretty modest, but it seems that, with a small
> number of such constructs, O'Caml could be turned into a rather nice
> textual manipulation language. (Something often requested on this
> list.) Opinions and further suggestions are welcome.

You might want to look at scsh[1] (my standard suggestion for this
list, it seems). The construct you implemented also exists in scsh,
under the name "let-match" (see page 134 of the scsh manual [2]). Many
other constructs are supported, like "if-match" (similar to let-match
but with a clause to be evaluated when the regular expression does not
match), "match-cond" (tries several regular expressions until one
matches) and so on.

Also very interesting in scsh is the sexp-based notation for regular
expressions (pages 112-... of [2]).

[1] http://www.swiss.ai.mit.edu/ftpdir/scsh/
    and http://sourceforge.net/projects/scsh/
[2] ftp://ftp-swiss.ai.mit.edu/pub/su/scsh/scsh-manual.ps

Michel.
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-17 10:36 ` Markus Mottl
@ 2001-07-17 12:15   ` Francois Pottier
  2001-07-17 12:39     ` Markus Mottl
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Pottier @ 2001-07-17 12:15 UTC (permalink / raw)
  To: Markus Mottl; +Cc: caml-list


> Note that it should be possible to assert the required number of subgroups
> at compile-time if the user supplied a constant string: you'd only have
> to compile the pattern string to a regexp within the camlp4-rule and
> check things there.

Sounds good, except this would require building a custom version of camlp4,
because it can't dynamically load the Pcre library (as far as I can tell).

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-17 11:45 ` Michel Schinz
@ 2001-07-17 12:18   ` Francois Pottier
  0 siblings, 0 replies; 10+ messages in thread
From: Francois Pottier @ 2001-07-17 12:18 UTC (permalink / raw)
  To: Michel Schinz; +Cc: caml-list


Michel,

> You might want to look at scsh[1] (my standard suggestion for this
> list, it seems).

Thanks for the pointer -- it looks very interesting indeed.

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-17 12:15   ` Francois Pottier
@ 2001-07-17 12:39     ` Markus Mottl
  2001-07-17 12:44       ` Daniel de Rauglaudre
  0 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2001-07-17 12:39 UTC (permalink / raw)
  To: Francois Pottier; +Cc: caml-list

On Tue, 17 Jul 2001, Francois Pottier wrote:
> > Note that it should be possible to assert the required number of subgroups
> > at compile-time if the user supplied a constant string: you'd only have
> > to compile the pattern string to a regexp within the camlp4-rule and
> > check things there.
> 
> Sounds good, except this would require building a custom version of camlp4,
> because it can't dynamically load the Pcre library (as far as I can tell).

I am not a camlp4-guru, but if I am not mistaken, such extensions should
be quite straightforward. If users have to preprocess their files anyway,
they are probably indifferent to whether their preprocessor is the
"plain vanilla" one or not.

Maybe Daniel could tell us how to implement this extension with least
effort?

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-17 12:39     ` Markus Mottl
@ 2001-07-17 12:44       ` Daniel de Rauglaudre
  2001-07-17 12:52         ` Markus Mottl
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel de Rauglaudre @ 2001-07-17 12:44 UTC (permalink / raw)
  To: caml-list

Hi,

On Tue, Jul 17, 2001 at 02:39:46PM +0200, Markus Mottl wrote:

> I am not a camlp4-guru, but if I am not mistaken, such extensions should
> be quite straightforward. If users have to preprocess their files anyway,
> they are probably indifferent to whether their preprocessor is the
> "plain vanilla" one or not.
> 
> Maybe Daniel could tell us how to implement this extension with least
> effort?

       man mkcamlp4

:-)

-- 
Daniel de RAUGLAUDRE
daniel.de_rauglaudre@inria.fr
http://cristal.inria.fr/~ddr/
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Some sugar for regexp matching using camlp4
  2001-07-17 12:44       ` Daniel de Rauglaudre
@ 2001-07-17 12:52         ` Markus Mottl
  0 siblings, 0 replies; 10+ messages in thread
From: Markus Mottl @ 2001-07-17 12:52 UTC (permalink / raw)
  To: Daniel de Rauglaudre; +Cc: caml-list

On Tue, 17 Jul 2001, Daniel de Rauglaudre wrote:
> > Maybe Daniel could tell us how to implement this extension with least
> > effort?
> 
>        man mkcamlp4

Ah, yes, that was quick :-)

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2001-07-17 12:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-16 15:54 [Caml-list] Some sugar for regexp matching using camlp4 Francois Pottier
2001-07-16 17:37 ` Alexander V. Voinov
2001-07-17  2:36   ` Brian Rogoff
2001-07-17 10:36 ` Markus Mottl
2001-07-17 12:15   ` Francois Pottier
2001-07-17 12:39     ` Markus Mottl
2001-07-17 12:44       ` Daniel de Rauglaudre
2001-07-17 12:52         ` Markus Mottl
2001-07-17 11:45 ` Michel Schinz
2001-07-17 12:18   ` Francois Pottier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).