From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <a560a5d00810251535i7c27be62g2455acd6c9888b2a@mail.gmail.com>
Date: Sun, 26 Oct 2008 00:35:44 +0200
From: "Rudolf Sykora" <rudolf.sykora@gmail.com>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@9fans.net>
In-Reply-To: <2e4a50a0810241652r38d2aa1ft2b6fb9104d2988ae@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <a560a5d00810240108q62854ec1w16614c90c7071436@mail.gmail.com>
	<20081024170237.68ED28DE7@okapi.maths.tcd.ie>
	<a560a5d00810241041w798aa12clb4e425eb74360408@mail.gmail.com>
	<dd6fe68a0810241101s6d8a9e57n80e28550ca80255@mail.gmail.com>
	<a560a5d00810241256q6ca59f46n4f0d70cf63475f51@mail.gmail.com>
	<dd6fe68a0810241410pe6593dcq79c7f4a139ee598e@mail.gmail.com>
	<a560a5d00810241440j467c6cccm617b5c1dec180aa3@mail.gmail.com>
	<6520c845566013ada472281bf9c0da73@coraid.com>
	<a560a5d00810241504m442ba788u8f9a34218d18576f@mail.gmail.com>
	<2e4a50a0810241652r38d2aa1ft2b6fb9104d2988ae@mail.gmail.com>
Subject: Re: [9fans] non greedy regular expressions
Topicbox-Message-UUID: 2729f7d0-ead4-11e9-9d60-3106f5b1d025

2008/10/25 Tom Simons <tom.simons@gmail.com>:
> Is awk available?  This worked for me, but it's not on Plan9.  It does copy
> the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
> should be deleted).
> $ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data

To that newline: It should copy the newline you describe, since that
one really is between delimiters. However, this one is also the only
one that should be copied. There shouldn't be any blank line anywhere
in the middle of the resulting output. In this sense your solution
doesn't work.

Your solution ALMOST works in linux. It shows not to work in plan9 at
all, probably due to the fact that in plan9 only the 1st character of
the RS variable is considered as the record delimiter.

But what I really wanted to see is how people using plan9 can solve
the problem without using a specialized minilanguage like awk. See
what Erik S. Raymond says in his Art of Unix programming:

http://www.faqs.org/docs/artu/ch08s02.html#awk

Basically he claims that the way this language was designed was
unfortunate. And that the language is on its decline. Among the
reasons is that languages like Perl, Python, Ruby all form a suitable
superset and that
'Programmers increasingly chose to do awklike things with Perl or
(later) Python, rather than keep two different scripting languages in
their heads'.

I myself may not be that competent to claim this too, but at least
from my own experience I have started to like to use as few tools as
possible. Thus I don't want to use awk any longer. I don't like perl
either (in my opinion it's a bad language). Python is nice for coding,
but somehow not handy for commandline use. Ruby seems to be superior
to all. So in my ideal (not the quickest though) world I'd rather get
rid of perl, awk, and use ruby instead, if anything more complicated
is needed.

Anyway, my main reason for the task was to see if someone can really
come with a nice solution using exclusively sam (and more, possibly
without that 's' command --- btw. noone so far has answered the
question of possible use of submatch tracking with commands other than
's'; remember 's' was designated unnecessary).

I wanted to see the thing be done in sam/acme without use of awk or
sed. That is Charles Forsyth's solution, which really works:
---
1. delete everything not between delimiters
       ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
2. delete the delimeters
       ,x/ABC|CBA/d
3. look to decide if i missed a boundary case for my input
---
I like it. It does exactly what I wanted. And here comes the point
I've been after all the time from the very beginning. I wanted to
show, that the solution has a very ugly part in itself, namely
([^C]|C[^B]|CB[^A]|\n)+
whose only reason is to ensure there is not CBA somewhere in the
middle. Imagine there would be something more complicated as a
delimiter. Imagine, I'd like the closing delimiter be either CBA or
EFG (any one would do). And I think you soon go mad.

In python ( http://www.amk.ca/python/howto/regex/), this is easily
solved with a non-greedy operator

/ABC(.*?)CBA/
/ABC(.*?)(CBA|EFG)/

It's true that non-greedy operators don't have a good meaning in Plan9
(as R. Cox explained), due to its leftmost-longest paradigm. However,
what I conclude from this example is, that the leftmost-first kind of
thinking with two kinds of ops (greedy/nongreedy) can be sometimes
quite useful.

Now. If the leftmost-longest match is usable for my problem, I am fine
with C + regexp(6). If not I only see the possibility to use
perl/python nowadays (if I don't want to go mad like above). Put it
the other way round. Perl/python can hopefully almost always be used,
they solve almost any problem with regexps to our liking. Then
universality wins and we may end up using perl/python exclusively. And
we will (people do) use them inspite of their wrong (i.e. slow;
perhaps horrible -- as some of you said) designs.

My question then is: wouldn't it be better to switch to the
leftmost-first paradigm, hence open possible use of (non-)greedy
operators, and in a way contribute to an accord with perl/python
syntax?  And use a good algorithm for that all? But maybe it's not
worth and the current state is just sufficient...

Ruda