caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Luc Maranget <luc.maranget@inria.fr>
To: lists@stefanheimann.net (Stefan Heimann)
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] ocamllex, regular expression syntax
Date: Fri, 23 May 2003 10:53:15 +0200 (MET DST)	[thread overview]
Message-ID: <200305230853.KAA0000027204@beaune.inria.fr> (raw)
In-Reply-To: <20030522205632.GA2130@kunz.ratzer> from "Stefan Heimann" at mai 22, 2003 10:56:33

> I new to ocaml and today I played a little bit around with
> ocamllex. Now I'm wondering why ocamllex has this strange regular
> expression syntax. One has to quoted every character, an arbitrary
> character is matched by the underscore...
> 
> The manual for ocamllex says: "The regular expressions are in the
> style of lex, with a more Caml-like syntax."
> 
> But the regular expression syntax in the Str module looks "normal" to
> me.
> 
> Regular expressions like this
> 
> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read, but with the ocamllex syntax it is even more
> difficult:
> 
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> (and harder to write).
> 
> Is this just for historical reason or is there a practical reason for
> this syntax? I'm just curious...
> 

Ah, regexp syntax ! I think I can explain a few principles, as I see
them.


Lex-like tools are part of, let us say, a compiler culture.
In lex-style regexp, you clearly have a too stage definition.

1. The tokens:
     Characters (caml-style) 'c', with some escape mechamism
(such as '\\')
     Various operators such as *, +, etc. or delimiters such as (, )
     Spacing between tokens is irrelevant.

2. From the tokens, regexp are defined as trees 

This allows a clean, regular, definition of regexp syntax. Moreover,
lexing conventions are the ones of Caml.
<http://caml.inria.fr/ocaml/htmlman/manual026.html#htoc126>

But then, as you noticed, users have to type many quotes.


Perl-like tools follow a different idea, they intend to minimize
keystrokes. I guess the first idea was to make unescaped/unquoted
characters correspond to their ``most frequent usage''.
The consequence is that users type many backslashes,

In my opinion, the meaning of quotes (ocamllex) is clear because they
express one simple construct: I want this caracter.
The meaning of backslahes (perl) is less clear, it means ``I want some
special meaning of this characters'', which covers many situations.
In particuler \ ordinary meanig is not ``a backslah, and this implies that
\\ means ``I want a backslash''. The same applies to *, whose default
meaning is being the repetition operator. This is a bit irregular in
my opinion.

Some additional problems arise when several meanings are considered.
consider, for instance, \1 (reference to \(..\) number one) and \001
(character whose code is one). It is no surprise that various regexp
tools disagree on such subtle points.

As a conclusion, lex way of doing things is inspired by design
(first lex, then parse), whereas perl way of doing things 
is inspired by minimizing users keystrokes, leading to, in my opinion, some
dark corners.


--Luc






-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


  parent reply	other threads:[~2003-05-23  8:53 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-05-22 20:56 Stefan Heimann
2003-05-22 23:04 ` David Brown
2003-05-23  8:36   ` Stefan Heimann
2003-05-23  6:31 ` Pierre Weis
2003-05-23  8:27   ` Stefan Heimann
2003-05-23  8:53 ` Luc Maranget [this message]
2003-06-02 23:42 ` John Max Skaller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200305230853.KAA0000027204@beaune.inria.fr \
    --to=luc.maranget@inria.fr \
    --cc=caml-list@inria.fr \
    --cc=lists@stefanheimann.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).