From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pierre.weis@inria.fr>
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 4FD80BB91
	for <caml-list@yquem.inria.fr>; Tue, 11 Jan 2005 13:09:17 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0BC9GFi010050
	for <caml-list@yquem.inria.fr>; Tue, 11 Jan 2005 13:09:16 +0100
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id NAA02758 for <caml-list@pauillac.inria.fr>; Tue, 11 Jan 2005 13:09:16 +0100 (MET)
Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.171])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j0BC9FeL027647
	for <caml-list@inria.fr>; Tue, 11 Jan 2005 13:09:16 +0100
Received: from [212.227.126.160] (helo=mrelayng.kundenserver.de)
	by moutng.kundenserver.de with esmtp (Exim 3.35 #1)
	id 1CoKpm-0003HY-00; Tue, 11 Jan 2005 13:09:14 +0100
Received: from [80.129.115.55] (helo=gate.gerd-stolpmann.de)
	by mrelayng.kundenserver.de with asmtp (Exim 3.35 #1)
	id 1CoKpm-0007c2-00; Tue, 11 Jan 2005 13:09:14 +0100
Received: from ice.gerd-stolpmann.de (ice [192.168.0.13])
	by gate.gerd-stolpmann.de (Postfix) with ESMTP id D6EEFC04C;
	Tue, 11 Jan 2005 13:09:13 +0100 (CET)
Received: by ice.gerd-stolpmann.de (Postfix, from userid 1000)
	id 0F44BB031; Tue, 11 Jan 2005 13:09:00 +0100 (CET)
Subject: Re: [Caml-list] Thread safe Str
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: skaller@users.sourceforge.net
Cc: Chris King <colanderman@gmail.com>,
	"O'Caml Mailing List" <caml-list@inria.fr>
In-Reply-To: <1105430813.3534.116.camel@pelican.wigram>
References: <Pine.LNX.4.44.0501101103500.1591-100000@localhost>
	 <1105415669.3534.55.camel@pelican.wigram>
	 <875c7e0705011023033fa114ef@mail.gmail.com>
	 <1105430813.3534.116.camel@pelican.wigram>
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Message-Id: <1105445337.15581.82.camel@ice.gerd-stolpmann.de>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.6 
Date: Tue, 11 Jan 2005 13:08:59 +0100
X-Provags-ID: kundenserver.de abuse@kundenserver.de auth:a6865a839c0178d9aa0ce41878507ea2
X-Miltered: at concorde with ID 41E3C1EC.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 41E3C1EC.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 gerd:01 stolpmann:01 wrote:01 wrote:01 sourceforge:01 parser:01 ocamllex:01 ocamllex:01 o'caml:01 lexer:01 foo:01 foo:01 regexp:01 regexps:01 
X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr
X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO 
	autolearn=disabled version=3.0.0
X-Spam-Level: 

On Die, 2005-01-11 at 09:06, skaller wrote:
> On Tue, 2005-01-11 at 18:03, Chris King wrote:
> > On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote:
> > > If you want captures use the proper tool, namely a parser,
> > > [...]
> > > If some technology is to be integrated, please use the right technology
> > > and integrate Ocamllex.

Which is already done. There is a camlp4 extension allowing one to
define ocamllex-like scanners as part of normal modules (pa_ocamllex).
It is even part of the O'caml distribution, in camlp4/unmaintained.
Well, there is also a maintained and internationalised version called
ulex by the same author.

> > Why write a lexer and all its necessary event handlers when one can
> > just write "s/foo(.*)bar/bar\1foo/g"?  

This is a typical example where one would prefer regexp engines over
lex-type scanners. Just because of simplicity.

> Just compare a real example from the Alioth Shootout:
> 
> Specification:
> ---------------------------------------------------
> The telephone number pattern: 
> 
>       * there may be zero or one telephone numbers per line of input
>       * a telephone number may start at the beginning of the line or be
>         preceeded by a non-digit, (which may be preceeded by anything)
>       * it begins with a 3-digit area code that looks like this (DDD) or
>         DDD (where D is [0-9])
>       * the area code is followed by one space
>       * which is followed by the 3 digits of the exchange: DDD
>       * the exchange is followed by a space or hyphen [ -]
>       * which is followed by the last 4 digits: DDDD
>       * which can be followed by end of line or a non-digit (which may
>         be followed by anything).

This is already a problem near the limit of writing regexps directly, in
one step. I think the problem is not the complexity of the problem, but
the string notation of regexps which makes any solution hard to read.

However, there are better criterions when to use which lexing
technology:

- When one wants to have a token stream as result: use (ocaml)lex
- When one can profit from recursive lexer definitions: use (ocaml)lex
- When the regexp is computed: use regexp engine

> ------ FELIX SOLUTION---------------------
> 
> regexp digit = ["0123456789"];
> regexp digits3 = digit digit digit;
> regexp digits4 =  digits3 digit;
> 
> regexp area_code = digits3 | "(" digits3 ")";
> regexp exchange = digits3;
> 
> regexp phone = area_code " " exchange (" " | "-") digits4;
> 
> fun lexit (start:iterator, finish:iterator): iterator * string =>
>   reglex start to finish with
>   | phone => check_context (lexeme_start, lexeme_end)
>   | _ => ""
>   endmatch
> ;

The pa_ocamllex/ulex solution would be very similar to this.

> > Regular expressions were
> > designed for pattern matching and substitution,
> 
> So called regular expressions are NOT mathematically
> regular expressions, and they were certainly NOT designed by
> people that knew what they were doing from a theoretical viewpoint.
> 
> Regular expressions have a precise mathematical foundation,
> and they do NOT include captures.

You mean back references, e.g.

((a|b|c)*)d\1

here matching strings where the part of the string before "d" is
identical to the part after "d". The set of matched strings is a
context-sensitive language!

Capturing just for the purpose of string extraction is not problematic.
Back references are unsound when they can match the empty word.

> Attempts to add captures to the theory have been made, 
> and none are satisfactory IMHO. Certainly none actually agree 
> with any implementations.

This is true, but I think it is practically irrelevant.

> PCRE has some wording about longest matches
> and left most capture but these semantics turn out to
> be inconsistent, and are not what PCRE actually implements.

There are often several ways of capturing. In practice, I found this
never to be a real problem, because there are almost always alternate
regexps avoiding such problems.

I still think that a regexp engine is a useful addition to a language
like ocaml. Even with camlp4 integration ocamllex is an overkill
solution for many simple matching problems (notation and execution
overhead).

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------