9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
@ 2007-02-22 22:16 Folkert van Heusden
  2007-02-22 23:17 ` Alberto Cortés
  2007-02-22 23:21 ` William Josephson
  0 siblings, 2 replies; 12+ messages in thread
From: Folkert van Heusden @ 2007-02-22 22:16 UTC (permalink / raw)
  To: 9fans

Hi,

A user of a program of mine (http://www.vanheusden.com/multitail/) tries
to use plan9 regexps under linux and doesn't succeed.
Am I right that plan9 regular expressions are not compatible with the
ones of "regular" unix?


Folkert van Heusden

-- 
www.vanheusden.com/multitail - multitail is tail on steroids. multiple
               windows, filtering, coloring, anything you can think of
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-22 22:16 [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Folkert van Heusden
@ 2007-02-22 23:17 ` Alberto Cortés
  2007-02-22 23:21 ` William Josephson
  1 sibling, 0 replies; 12+ messages in thread
From: Alberto Cortés @ 2007-02-22 23:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Folkert van Heusden said:

> Hi,
> 
> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
> to use plan9 regexps under linux and doesn't succeed.
> Am I right that plan9 regular expressions are not compatible with the
> ones of "regular" unix?

They are different. I am not very sure what you mean by "regular"
UNIX regexp, as far as I now in Linux each command seems to use
different sets of regexps.

As for plan9, you can read regexp(6) at:

    http://plan9.bell-labs.com/magic/man2html/6/regexp

Sam also support structural regexps:

    http://plan9.bell-labs.com/sources/contrib/uriel/mirror/se.pdf


-- 
Alberto Cortés


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-22 22:16 [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Folkert van Heusden
  2007-02-22 23:17 ` Alberto Cortés
@ 2007-02-22 23:21 ` William Josephson
  2007-02-22 23:48   ` Russ Cox
  1 sibling, 1 reply; 12+ messages in thread
From: William Josephson @ 2007-02-22 23:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Feb 22, 2007 at 11:16:26PM +0100, Folkert van Heusden wrote:
> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
> to use plan9 regexps under linux and doesn't succeed.
> Am I right that plan9 regular expressions are not compatible with the
> ones of "regular" unix?

Many unix programs don't use ``extended'' regular expressions by
default.  See regexp(7) on Plan 9 or try egrep/grep -E under Unix.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-22 23:21 ` William Josephson
@ 2007-02-22 23:48   ` Russ Cox
  2007-02-23  6:27     ` Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) Joel Salomon
  2007-02-23 11:19     ` [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Gorka Guardiola
  0 siblings, 2 replies; 12+ messages in thread
From: Russ Cox @ 2007-02-22 23:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Many unix programs don't use ``extended'' regular expressions by
> default.  See regexp(7) on Plan 9 or try egrep/grep -E under Unix.

The Plan 9 regexp library matches the old Unix egrep command.
Any regexp you'd try under Plan 9 should work with new egreps,
though not vice versa -- new egreps tend to have newfangled
additions like [:upper:] and \w and {4,6} for repetition.

Russ


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?)
  2007-02-22 23:48   ` Russ Cox
@ 2007-02-23  6:27     ` Joel Salomon
  2007-02-23  6:54       ` William K. Josephson
  2007-02-23 17:33       ` Russ Cox
  2007-02-23 11:19     ` [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Gorka Guardiola
  1 sibling, 2 replies; 12+ messages in thread
From: Joel Salomon @ 2007-02-23  6:27 UTC (permalink / raw)
  To: 9fans

On 2/22/07, Russ Cox <rsc@swtch.com> wrote:
> The Plan 9 regexp library matches the old Unix egrep command.
> Any regexp you'd try under Plan 9 should work with new egreps,
> though not vice versa -- new egreps tend to have newfangled
> additions like [:upper:] and \w and {4,6} for repetition.

This came up as I was implementing my C lexer for the compilers class
I'm taking.  How hard would it be to allow access to regcomp(2)'s
internals, so I could build up a regexp part-by part a la lex?

For example, to recognize C99 hexadecimal floating-point constants, I
wrote a second program that builds up the regexp piece-by-piece using
smprint(2), then compiling the whole thing:

	char	*decdig = "([0-9])",
		*hexdig = "([0-9A-Fa-f])",
		*sign = "([+\\-])",
		*dot = "(\\.)",
		*dseq, *dexp, *dfrac, *decflt,
		*hseq, *bexp, *hfrac, *hexflt;
	dseq = smprint("(%s+)", decdig);
	dexp = smprint("([Ee]%s?%s)", sign, dseq);
	dfrac = smprint("((%s?%s%s)|(%s%s))", dseq, dot, dseq, dseq, dot);
	decflt = smprint("(%s%s?)|(%s%s)", dfrac, dexp, dseq, dexp);
	regcomp(decflt);	// make sure it compiles
	print("decfloat: %s\n", decflt);
	
	hseq = smprint("(%s+)", hexdig);
	bexp = smprint("([Pp]%s?%s)", sign, dseq);
	hfrac = smprint("((%s?%s%s)|(%s%s))", hseq, dot, hseq, hseq, dot);
	hexflt = smprint("0[Xx](%s|%s)%s", hfrac, hseq, bexp);
	regcomp(hexflt);	// make sure it compiles
	print("hexfloat: %s\n", hexflt);

I know that regcomp builds up the Reprog by combining subprograms with
catenation and alternation &c., but I’d be loath to try tinkering
there directly without a much better understanding of the algorithm.
I’ve glanced through the documents at swtch.com/?????  and the regcomp
source code, just haven’t had the time for an in-depth study.

Would such a project be a worthwhile spent of time?  (Might it develop
into the asteroid to kill the dinosaur waiting for it?)

--Joel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?)
  2007-02-23  6:27     ` Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) Joel Salomon
@ 2007-02-23  6:54       ` William K. Josephson
  2007-02-23 13:34         ` Joel C. Salomon
  2007-02-23 17:33       ` Russ Cox
  1 sibling, 1 reply; 12+ messages in thread
From: William K. Josephson @ 2007-02-23  6:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Feb 23, 2007 at 01:27:56AM -0500, Joel Salomon wrote:
> Would such a project be a worthwhile spent of time?  (Might it develop
> into the asteroid to kill the dinosaur waiting for it?)

Why go to the trouble?  For C, the lexer is easy
enough to just write by hand.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-22 23:48   ` Russ Cox
  2007-02-23  6:27     ` Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) Joel Salomon
@ 2007-02-23 11:19     ` Gorka Guardiola
  2007-02-23 12:12       ` erik quanstrom
  1 sibling, 1 reply; 12+ messages in thread
From: Gorka Guardiola @ 2007-02-23 11:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Also, I am not sure if you can use expressions with big unicode
characteres in Unix, last time I looked with sed, you could not.

On 2/23/07, Russ Cox <rsc@swtch.com> wrote:
> > Many unix programs don't use ``extended'' regular expressions by
> > default.  See regexp(7) on Plan 9 or try egrep/grep -E under Unix.
>
> The Plan 9 regexp library matches the old Unix egrep command.
> Any regexp you'd try under Plan 9 should work with new egreps,
> though not vice versa -- new egreps tend to have newfangled
> additions like [:upper:] and \w and {4,6} for repetition.
>
> Russ
>


-- 
- curiosity sKilled the cat


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-23 11:19     ` [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Gorka Guardiola
@ 2007-02-23 12:12       ` erik quanstrom
  2007-02-23 12:17         ` Gorka Guardiola
  0 siblings, 1 reply; 12+ messages in thread
From: erik quanstrom @ 2007-02-23 12:12 UTC (permalink / raw)
  To: 9fans

utf-8 encoding will "just work" (unless the gnu folk are
rearranging characters with the bucky bit set) or if
the result depends on knowing the width of a character,
e.g. in

a)	a character class
b)	matching a single character with ".".

for example for a file "fu" with these lines

	α0
	β0
	α1

(no leading tab) i get these results with no
local settings at all.

	; grep δ fu
	δ0

works because as far as grep is concerned, the string
i asked for 03 b4 is in there.  this works, too

	; egrep '(ε|δ)0' fu
	ε0
	δ0

and this works because there is a character before
"0" on the line:

	; egrep '.0' fu
	ε0
	δ0

but this doesn't

	; egrep '[αβ]0' fu
	; egrep '^.0' fu

this is for gnu grep version

	; egrep --version
	egrep (GNU grep) 2.5.1


- erik


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-23 12:12       ` erik quanstrom
@ 2007-02-23 12:17         ` Gorka Guardiola
  2007-02-23 13:02           ` erik quanstrom
  0 siblings, 1 reply; 12+ messages in thread
From: Gorka Guardiola @ 2007-02-23 12:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

If it doesn't for one case, then it doesn't.

On 2/23/07, erik quanstrom <quanstro@coraid.com> wrote:
>         ; egrep '[αβ]0' fu
>         ; egrep '^.0' fu
>


-- 
- curiosity sKilled the cat

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] regular expressions in plan9 different from the ones in unix? (at least linux)
  2007-02-23 12:17         ` Gorka Guardiola
@ 2007-02-23 13:02           ` erik quanstrom
  0 siblings, 0 replies; 12+ messages in thread
From: erik quanstrom @ 2007-02-23 13:02 UTC (permalink / raw)
  To: 9fans

i don't think that sort of absolutist thinking really works.
i used gnu grep (and all the other gnu tools) on utf-8 stuff 
from the time of the first sam release for unix till i stopped using 
linux for much development.  i never had a problem with
g(ed|sed|awk|e?grep) tripping on utf-8 when the local was
unset or "C".  i did keep in mind that . wasn't going to match
"☺", though.

we all know the limitations of our tools.  that doesn't make
them broken.  

just because plan 9 does bad things if you exceed NPROCS,
doesn't make it broken.

- erik

On 2/23/07, erik quanstrom <quanstro@coraid.com> wrote:
>         ; egrep '[��]0' fu
>         ; egrep '^.0' fu
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?)
  2007-02-23  6:54       ` William K. Josephson
@ 2007-02-23 13:34         ` Joel C. Salomon
  0 siblings, 0 replies; 12+ messages in thread
From: Joel C. Salomon @ 2007-02-23 13:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 2/23/07, William K. Josephson <jkw@eecs.harvard.edu> wrote:
> On Fri, Feb 23, 2007 at 01:27:56AM -0500, Joel Salomon wrote:
> > Would such a project be a worthwhile spent of time?  (Might it develop
> > into the asteroid to kill the dinosaur waiting for it?)
>
> Why go to the trouble?  For C, the lexer is easy
> enough to just write by hand.

For a useful and significant subset of C, the lexer is easy enough to
just write by hand.  I was trying for full C99 (what were those ISO
guys drinking?).  I spent far too much time on it to call the task
"easy".

I have what I believe is a pretty complete C lexer
(http://www.tip9ug.jp/who/chesky/comp/lex.c).  It still is far from
being integrated into a full grammar, but it scans cpp(1) output
nicely.  I tested it against some of the odder "features" of C99—UCNs,
hex floats, &c.—and it seems to work.

Some parts were easy, some less so, and some looked easy until they
turned out to be subtly wrong.  Recognizing whether the number seen is
an integer (in decimal, octal, or hex) or a real number was one of the
hard parts, and one I gladly handed off to a regexp.  The way I
generated the regexp may not be ideal, as someone pointed out to me
off-list, but hand-generated code that recognizes what sort of number
was seen would be exactly equivalent to the regexp, and less readable.

--Joel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?)
  2007-02-23  6:27     ` Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) Joel Salomon
  2007-02-23  6:54       ` William K. Josephson
@ 2007-02-23 17:33       ` Russ Cox
  1 sibling, 0 replies; 12+ messages in thread
From: Russ Cox @ 2007-02-23 17:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Lex has three benefits:

 1) You don't have to write the lexer directly.
 2) What you do have to write is fairly concise.
 3) The resulting lexer is fairly efficient.

It has two main drawbacks:

 4) The input model does not always match your
 own program's input model, creating a messy interface.
 5) Once you need more than regular expressions,
 lexers written with state variables and such can get
 very opaque very fast.

Many on this list would argue that (1) and (2) do not
outweigh (4) and (5), instead suggesting that writing a
lexer by hand is not too difficult and ends up being
more maintainable than a lex spec in the long run.
And of course, for a well-written by-hand lexer,
you get to keep (3).

Creating new entry hooks in the regexp library doesn't
preserve (1), (2), or (3).  And if much of your time is
spent in lexical analysis (as Ken claimed was true for
the Plan 9 compilers), losing (3) is a big deal.
So that seems like not a very good replacement for lex.

All that said, lex has been used to write a lot of C
compilers, and can be used in that context without
running into much of (4) or (5).  Why not just use lex here?

Russ


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-02-23 17:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-22 22:16 [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Folkert van Heusden
2007-02-22 23:17 ` Alberto Cortés
2007-02-22 23:21 ` William Josephson
2007-02-22 23:48   ` Russ Cox
2007-02-23  6:27     ` Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) Joel Salomon
2007-02-23  6:54       ` William K. Josephson
2007-02-23 13:34         ` Joel C. Salomon
2007-02-23 17:33       ` Russ Cox
2007-02-23 11:19     ` [9fans] regular expressions in plan9 different from the ones in unix? (at least linux) Gorka Guardiola
2007-02-23 12:12       ` erik quanstrom
2007-02-23 12:17         ` Gorka Guardiola
2007-02-23 13:02           ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).