caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] [ann] Regexp library supporting binding for * and +'s
@ 2004-09-19 20:41 Yutaka OIWA
  2004-09-20  0:38 ` skaller
  0 siblings, 1 reply; 4+ messages in thread
From: Yutaka OIWA @ 2004-09-19 20:41 UTC (permalink / raw)
  To: caml-list

Hi everyone at caml-list,

From the computer room at ICFP2004 in Snowbird Resort,
I announce a beta version of my combinator-based
regular-expression match library which supports
list (Kleene-*) binding.

This library provide a set of typed "combinators" which can be
used to construct "regular expression matcher", which tests strings
against regexps and capture the matched substring in various ways.
Especially, powerful "repeat" combinator, which corresponds
to * and + operators in conventional regular expression notation,
returns all values captured inside as a list value.

For example, the small code below

  open Regexp_pp_ng
  let s = "1 2 3 4 5" in
  match_string s (repeat ~sep:spacesA int_decimal) (fun x -> x)

returns [1; 2; 3; 4; 5]: int list.

All combinators are given static types and any mismatch of
value types and matcher types are statically rejected.

The implementation is available from a subversion repository.
Using subversion, you can checkout the URL

 https://www.oiwa.jp/svn/regexp-ocaml/branches/combinators/

to get the up-to-date implementation, or you can directly
access the above address by web browsers to see the latest revision.
There is also a ViewCVS interface at the following address.

 http://www.oiwa.jp/viewcvs/regexp-ocaml/branches/combinators/

See regexp_pp_ng.mli for interfaces, and regexp_pp_ng_test.ml for
some example of the use of this library.

It may work partially on some older OCaml, but for real use
it requires a newer version (3.07 or later) which supports
the relaxed value restriction.

I plan to construct a neat syntax sugar over this library 
and build a next-generation version of Regexp/OCaml library.
Any comments are welcome.

-- 
Yutaka Oiwa              Yonezawa Lab., Dept. of Computer Science,
      Graduate School of Information Sci. & Tech., Univ. of Tokyo.
                    <oiwa@yl.is.s.u-tokyo.ac.jp>, <yutaka@oiwa.jp>
PGP fingerprint = C9 8D 5C B8 86 ED D8 07  EA 59 34 D8 F4 65 53 61

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] [ann] Regexp library supporting binding for * and +'s
  2004-09-19 20:41 [Caml-list] [ann] Regexp library supporting binding for * and +'s Yutaka OIWA
@ 2004-09-20  0:38 ` skaller
  2004-09-20  6:54   ` Yutaka OIWA
  0 siblings, 1 reply; 4+ messages in thread
From: skaller @ 2004-09-20  0:38 UTC (permalink / raw)
  To: Yutaka OIWA; +Cc: caml-list

On Mon, 2004-09-20 at 06:41, Yutaka OIWA wrote:

> >From the computer room at ICFP2004 in Snowbird Resort,
> I announce a beta version of my combinator-based
> regular-expression match library which supports
> list (Kleene-*) binding.

> I plan to construct a neat syntax sugar over this library 
> and build a next-generation version of Regexp/OCaml library.
> Any comments are welcome.

Can you explain why/how Pcre is being used?

I'm currently looking at providing the same kind
of facility, however I need: 

(a) all pure Ocaml -- reason: maintenance, soundness

(b) able to generate fairly simple automata
Reason-- the execution target may be C,
so it must be possible to both encode the data
fairly simply, and also to provide C routines
to execute various automata based on that data,
without building complex data structures.

(c) must process at least a stream of integer inputs
Reason: 8 bit inputs are unacceptable for i18n reasons.
In addition, there are uses of state machines other
than processing 'strings'.

I'd like to combine at least (i) tokenisation 
and (ii) substring extraction however a more general
facility such as parsing as in C/XDuce is also appealing.
Alternatively, or as well, processing tagged NFA's 
readily yields RTNs and hence CFG parsing support.

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net



-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] [ann] Regexp library supporting binding for * and +'s
  2004-09-20  0:38 ` skaller
@ 2004-09-20  6:54   ` Yutaka OIWA
  2004-09-20 11:12     ` skaller
  0 siblings, 1 reply; 4+ messages in thread
From: Yutaka OIWA @ 2004-09-20  6:54 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

>> On 20 Sep 2004 10:38:33 +1000, skaller <skaller@users.sourceforge.net> said:

skaller> On Mon, 2004-09-20 at 06:41, Yutaka OIWA wrote:
>> I plan to construct a neat syntax sugar over this library 
>> and build a next-generation version of Regexp/OCaml library.
>> Any comments are welcome.

skaller> Can you explain why/how Pcre is being used?

The reason is simply current implemenentation convenience.
It is stable, has enough features (e.g. unlimited number of captures,
non-capturing groups, much of helper functions and runtime features,
and is well-performing. My intension is not to implement automata engine
by myself, at least in near future.

However, as you can see in README in Regexp-OCaml (main version), my
future plan includes supporting backends other than PCRE/OCaml.
Having its own regexp parser and limiting regexp syntax to strict
regular language are the provision for possible future.
At the time of OCaml 3.07 released, I really considered to support
the standard Str module, but unfortunately current Str lacks some of
the features required by current Regexp/OCaml implementation.
Anyway, backend is backend. And also, frontend is frontend. Period.
It can be highly independent once it designed so, and my interests
are mainly in the frontend part. I highly appreciete supports from 
people working on the backend part.

Multilingualization is one in current high-priority to-do list.
At least one of the users requested me to support EUC-JP patterns,
and you might be the second person :-)
I am considering how to support M17N feature: it may depends to
underlying backends (e.g. Camomile?), or it may be supported solely in the
frontend layer, by encoding multibyte handling into regexps.
This trick is used in the Japanese port of Perl interpreter on MS-DOS,
and (at least) one of Japanese handling module for Perl5.
# As you can imagine, just using M17N feature of underlying library is
# not sufficient: internal regexp parser must also modified to accept
# multibyte-encoded regular expression. This is one of the reason that 
# curent Regexp/OCaml does not support UTF8 option of PCRE/OCaml.

For supporting list-binding of Kleene-stars, I am very interested in
richer backends which supports such features.  Alain Frisch's recent 
posting has interested me.  There is also a talk with related title in
ICFP04, although I had not yet read the paper.
However, I feel at the same time that backend is not a current show-stopper:
it is truly better to have such backends, but it can be emulated without that,
As I had shown in the combinators.  I can wait for a while for
theretical/practical progresses. Current problem is mainly the frontend:
there are many language-design problems once we introduce nested bindings.
I already had a discussion with some people in ICFP04, and I hope more.

-- 
Yutaka Oiwa              Yonezawa Lab., Dept. of Computer Science,
      Graduate School of Information Sci. & Tech., Univ. of Tokyo.
                    <oiwa@yl.is.s.u-tokyo.ac.jp>, <yutaka@oiwa.jp>
PGP fingerprint = C9 8D 5C B8 86 ED D8 07  EA 59 34 D8 F4 65 53 61

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] [ann] Regexp library supporting binding for * and +'s
  2004-09-20  6:54   ` Yutaka OIWA
@ 2004-09-20 11:12     ` skaller
  0 siblings, 0 replies; 4+ messages in thread
From: skaller @ 2004-09-20 11:12 UTC (permalink / raw)
  To: Yutaka OIWA; +Cc: caml-list

On Mon, 2004-09-20 at 16:54, Yutaka OIWA wrote:
> >> On 20 Sep 2004 10:38:33 +1000, skaller <skaller@users.sourceforge.net> said:

>  I can wait for a while for
> theretical/practical progresses. Current problem is mainly the frontend:
> there are many language-design problems once we introduce nested bindings.
> I already had a discussion with some people in ICFP04, and I hope more.

OK, keep us posted on anything that comes out of it.
My engine supports lexical analysis, but I can't
do substring extraction. Unclear whether to move
to supporting tagged automata, or use Alain Frisch
parser, or both :)

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net



-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-09-20 11:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-19 20:41 [Caml-list] [ann] Regexp library supporting binding for * and +'s Yutaka OIWA
2004-09-20  0:38 ` skaller
2004-09-20  6:54   ` Yutaka OIWA
2004-09-20 11:12     ` skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).