caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] ocamllex, regular expression syntax
@ 2003-05-22 20:56 Stefan Heimann
  2003-05-22 23:04 ` David Brown
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Stefan Heimann @ 2003-05-22 20:56 UTC (permalink / raw)
  To: caml-list

Hi,

[sorry if this posting appears twice. I first submitted it with my
news client. It seems not to appear on the mailing list and so I
decided to post it again]


I new to ocaml and today I played a little bit around with
ocamllex. Now I'm wondering why ocamllex has this strange regular
expression syntax. One has to quoted every character, an arbitrary
character is matched by the underscore...

The manual for ocamllex says: "The regular expressions are in the
style of lex, with a more Caml-like syntax."

But the regular expression syntax in the Str module looks "normal" to
me.

Regular expressions like this

"[^"\\]*(\\.[^"\\]*)*"

are not easy to read, but with the ocamllex syntax it is even more
difficult:

'"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'

(and harder to write).

Is this just for historical reason or is there a practical reason for
this syntax? I'm just curious...



Bye,
  Stefan

-- 
Stefan Heimann
http://www.stefanheimann.net :: personal website.
http://cvsshell.sf.net       :: CvsShell, a console based cvs client.

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-22 20:56 [Caml-list] ocamllex, regular expression syntax Stefan Heimann
@ 2003-05-22 23:04 ` David Brown
  2003-05-23  8:36   ` Stefan Heimann
  2003-05-23  6:31 ` Pierre Weis
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: David Brown @ 2003-05-22 23:04 UTC (permalink / raw)
  To: caml-list

On Thu, May 22, 2003 at 10:56:33PM +0200, Stefan Heimann wrote:

> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read, but with the ocamllex syntax it is even more
> difficult:
> 
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'

But don't you find:

   '"'  [^ '"' '\\' ]*  ( '\\' _ [^ '"' '\\']* )*

easier to read.  I certainly do.

Perl even has a mode where whitespace can be inserted into regexps to
make them more readable.

Dave

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-22 20:56 [Caml-list] ocamllex, regular expression syntax Stefan Heimann
  2003-05-22 23:04 ` David Brown
@ 2003-05-23  6:31 ` Pierre Weis
  2003-05-23  8:27   ` Stefan Heimann
  2003-05-23  8:53 ` Luc Maranget
  2003-06-02 23:42 ` John Max Skaller
  3 siblings, 1 reply; 7+ messages in thread
From: Pierre Weis @ 2003-05-23  6:31 UTC (permalink / raw)
  To: Stefan Heimann; +Cc: caml-list

Hi,

[...]
> But the regular expression syntax in the Str module looks "normal" to
> me.
> 
> Regular expressions like this
> 
> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read,

I suppose you did not try this, since it is not a legal regular
expression. I guess you mean

"[^\\"\\]*(\\.[^\\"\\]*)*"

(Hence, the ``normal looking'' of those reg-exps does not imply simple,
clear, and natural syntax !)

> but with the ocamllex syntax it is even more
> difficult:
>
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> (and harder to write).

It is not so clear to me: the ' conventions are exactly those of the
language (hence there is no need to \\ the " symbols), the _ gets its
"normal" meaning of ``whatever'' or ``catch all case'' pattern...

> Is this just for historical reason or is there a practical reason for
> this syntax? I'm just curious...

It's just natural: you would start by giving syntax to match chars,
hence you ``naturally'' write them inside quotes following the Caml
convention. The rest of the regular expressions constructs,
succession, alternative, repetitition, range and catch-all just follow
almost automatically.

Regards,

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-23  6:31 ` Pierre Weis
@ 2003-05-23  8:27   ` Stefan Heimann
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Heimann @ 2003-05-23  8:27 UTC (permalink / raw)
  To: caml-list

Hi,

On Fri, May 23, 2003 at 08:31:39AM +0200, Pierre Weis wrote:
> Hi,
>
> [...]
> > But the regular expression syntax in the Str module looks "normal"
to
> > me.
> >
> > Regular expressions like this
> >
> > "[^"\\]*(\\.[^"\\]*)*"
> >
> > are not easy to read,
>
> I suppose you did not try this, since it is not a legal regular
> expression. I guess you mean
>
> "[^\\"\\]*(\\.[^\\"\\]*)*"
>
> (Hence, the ``normal looking'' of those reg-exps does not imply
simple,
> clear, and natural syntax !)

no, the regular expression I mentioned is correct. The double quotes
do not enclose the regex but are part of it (maybe that's what
confused you):

stefan@kunz:~$ echo '"Hello \"World\"!"' | egrep
'"[^"\\]*(\\.[^"\\]*)*"'
"Hello \"World\"!"
stefan@kunz:~$ echo 'x' | egrep '"[^"\\]*(\\.[^"\\]*)*"'
stefan@kunz:~$

The single quotes just protect the string from shell expansion...

> > but with the ocamllex syntax it is even more
> > difficult:
> >
> > '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> >
> > (and harder to write).
>
> It is not so clear to me: the ' conventions are exactly those of the
> language (hence there is no need to \\ the " symbols), the _ gets
its
> "normal" meaning of ``whatever'' or ``catch all case'' pattern...

Normally you don't have to escape the " symbol with a backslash. Only
when you put them inside double quotes because you use the regular
expression as a normal string. But then it becomes really messy... In
ocaml's Str module syntax it looks like this:

let re = regexp "\"[^\"\\\\]*\\(\\\\.[^\"\\\\]*\\)*\""

> [...]

Bye,
  Stefan

--
Stefan Heimann
http://www.stefanheimann.net :: personal website.
http://cvsshell.sf.net       :: CvsShell, a console based cvs client.

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-22 23:04 ` David Brown
@ 2003-05-23  8:36   ` Stefan Heimann
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Heimann @ 2003-05-23  8:36 UTC (permalink / raw)
  To: caml-list

On Thu, May 22, 2003 at 04:04:24PM -0700, David Brown wrote:
> On Thu, May 22, 2003 at 10:56:33PM +0200, Stefan Heimann wrote:
> 
> > "[^"\\]*(\\.[^"\\]*)*"
> > 
> > are not easy to read, but with the ocamllex syntax it is even more
> > difficult:
> > 
> > '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> But don't you find:
> 
>    '"'  [^ '"' '\\' ]*  ( '\\' _ [^ '"' '\\']* )*
> 
> easier to read.  I certainly do.

For me there is too much "noise" in the regular expression. All the
single quotes do not have a meaning, the just say "this is a
character". But I know that " or x is a character, I don't think
writing '"' or 'x' makes this fact clearer.
 
> Perl even has a mode where whitespace can be inserted into regexps to
> make them more readable.

That's right, but you don't have to use single quotes around every
character.

Bye,
  Stefan

-- 
Stefan Heimann
http://www.stefanheimann.net :: personal website.
http://cvsshell.sf.net       :: CvsShell, a console based cvs client.

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-22 20:56 [Caml-list] ocamllex, regular expression syntax Stefan Heimann
  2003-05-22 23:04 ` David Brown
  2003-05-23  6:31 ` Pierre Weis
@ 2003-05-23  8:53 ` Luc Maranget
  2003-06-02 23:42 ` John Max Skaller
  3 siblings, 0 replies; 7+ messages in thread
From: Luc Maranget @ 2003-05-23  8:53 UTC (permalink / raw)
  To: Stefan Heimann; +Cc: caml-list

> I new to ocaml and today I played a little bit around with
> ocamllex. Now I'm wondering why ocamllex has this strange regular
> expression syntax. One has to quoted every character, an arbitrary
> character is matched by the underscore...
> 
> The manual for ocamllex says: "The regular expressions are in the
> style of lex, with a more Caml-like syntax."
> 
> But the regular expression syntax in the Str module looks "normal" to
> me.
> 
> Regular expressions like this
> 
> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read, but with the ocamllex syntax it is even more
> difficult:
> 
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> (and harder to write).
> 
> Is this just for historical reason or is there a practical reason for
> this syntax? I'm just curious...
> 

Ah, regexp syntax ! I think I can explain a few principles, as I see
them.


Lex-like tools are part of, let us say, a compiler culture.
In lex-style regexp, you clearly have a too stage definition.

1. The tokens:
     Characters (caml-style) 'c', with some escape mechamism
(such as '\\')
     Various operators such as *, +, etc. or delimiters such as (, )
     Spacing between tokens is irrelevant.

2. From the tokens, regexp are defined as trees 

This allows a clean, regular, definition of regexp syntax. Moreover,
lexing conventions are the ones of Caml.
<http://caml.inria.fr/ocaml/htmlman/manual026.html#htoc126>

But then, as you noticed, users have to type many quotes.


Perl-like tools follow a different idea, they intend to minimize
keystrokes. I guess the first idea was to make unescaped/unquoted
characters correspond to their ``most frequent usage''.
The consequence is that users type many backslashes,

In my opinion, the meaning of quotes (ocamllex) is clear because they
express one simple construct: I want this caracter.
The meaning of backslahes (perl) is less clear, it means ``I want some
special meaning of this characters'', which covers many situations.
In particuler \ ordinary meanig is not ``a backslah, and this implies that
\\ means ``I want a backslash''. The same applies to *, whose default
meaning is being the repetition operator. This is a bit irregular in
my opinion.

Some additional problems arise when several meanings are considered.
consider, for instance, \1 (reference to \(..\) number one) and \001
(character whose code is one). It is no surprise that various regexp
tools disagree on such subtle points.

As a conclusion, lex way of doing things is inspired by design
(first lex, then parse), whereas perl way of doing things 
is inspired by minimizing users keystrokes, leading to, in my opinion, some
dark corners.


--Luc






-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] ocamllex, regular expression syntax
  2003-05-22 20:56 [Caml-list] ocamllex, regular expression syntax Stefan Heimann
                   ` (2 preceding siblings ...)
  2003-05-23  8:53 ` Luc Maranget
@ 2003-06-02 23:42 ` John Max Skaller
  3 siblings, 0 replies; 7+ messages in thread
From: John Max Skaller @ 2003-06-02 23:42 UTC (permalink / raw)
  To: Stefan Heimann; +Cc: caml-list

Stefan Heimann wrote:

> Hi,
> 
> [sorry if this posting appears twice. I first submitted it with my
> news client. It seems not to appear on the mailing list and so I
> decided to post it again]
> 
> 
> I new to ocaml and today I played a little bit around with
> ocamllex. Now I'm wondering why ocamllex has this strange regular
> expression syntax. One has to quoted every character
> Regular expressions like this
> 
> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read, but with the ocamllex syntax it is even more
> difficult:
> 
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> (and harder to write).
> 
> Is this just for historical reason or is there a practical reason for
> this syntax? 


The ocamllex syntax is MUCH more readable
if you figure out how to use it correctly:

let bindigit = ['0'-'1']
let octdigit = ['0'-'7']
let digit = ['0'-'9']
let hexdigit = digit | ['A'-'F'] | ['a'-'f']


let bin_lit  = '0' ('b' | 'B') (underscore? bindigit) +
let oct_lit  = '0' ('o' | 'O') (underscore? octdigit) +
let dec_lit  = ('0' ('d' | 'D'))? digit (underscore? digit) *
let hex_lit  = '0' ('x' | 'X') (underscore? hexdigit)  +

The reason for quoting characters is now obvious:
ocamllex provides regular *definitions* not just
regular expressions, and they're infinitely superior;
its much better to use identifers for expressions,
than to embed them in strings like pcre

	"<alpha>*" // pcre
	alpha * // ocamllex

You'd be mad not to write your example like this:

let quote = '"'
let slosh = "\\"
let any = _
let nsq = [^'\\''"'] (* WEAK! *)

dquote nsq * ( any nsq * ) * dquote

which I can actually read :-)

the [] syntax is weak though, Felix does much better
(and regular definitions are built into the language
like patterns are in ocaml)

-- 
John Max Skaller, mailto:skaller@ozemail.com.au
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-06-02 23:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-22 20:56 [Caml-list] ocamllex, regular expression syntax Stefan Heimann
2003-05-22 23:04 ` David Brown
2003-05-23  8:36   ` Stefan Heimann
2003-05-23  6:31 ` Pierre Weis
2003-05-23  8:27   ` Stefan Heimann
2003-05-23  8:53 ` Luc Maranget
2003-06-02 23:42 ` John Max Skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).