From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: Date: Sun, 26 Oct 2008 00:35:44 +0200 From: "Rudolf Sykora" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@9fans.net> In-Reply-To: <2e4a50a0810241652r38d2aa1ft2b6fb9104d2988ae@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20081024170237.68ED28DE7@okapi.maths.tcd.ie> <6520c845566013ada472281bf9c0da73@coraid.com> <2e4a50a0810241652r38d2aa1ft2b6fb9104d2988ae@mail.gmail.com> Subject: Re: [9fans] non greedy regular expressions Topicbox-Message-UUID: 2729f7d0-ead4-11e9-9d60-3106f5b1d025 2008/10/25 Tom Simons : > Is awk available? This worked for me, but it's not on Plan9. It does copy > the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines > should be deleted). > $ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data To that newline: It should copy the newline you describe, since that one really is between delimiters. However, this one is also the only one that should be copied. There shouldn't be any blank line anywhere in the middle of the resulting output. In this sense your solution doesn't work. Your solution ALMOST works in linux. It shows not to work in plan9 at all, probably due to the fact that in plan9 only the 1st character of the RS variable is considered as the record delimiter. But what I really wanted to see is how people using plan9 can solve the problem without using a specialized minilanguage like awk. See what Erik S. Raymond says in his Art of Unix programming: http://www.faqs.org/docs/artu/ch08s02.html#awk Basically he claims that the way this language was designed was unfortunate. And that the language is on its decline. Among the reasons is that languages like Perl, Python, Ruby all form a suitable superset and that 'Programmers increasingly chose to do awklike things with Perl or (later) Python, rather than keep two different scripting languages in their heads'. I myself may not be that competent to claim this too, but at least from my own experience I have started to like to use as few tools as possible. Thus I don't want to use awk any longer. I don't like perl either (in my opinion it's a bad language). Python is nice for coding, but somehow not handy for commandline use. Ruby seems to be superior to all. So in my ideal (not the quickest though) world I'd rather get rid of perl, awk, and use ruby instead, if anything more complicated is needed. Anyway, my main reason for the task was to see if someone can really come with a nice solution using exclusively sam (and more, possibly without that 's' command --- btw. noone so far has answered the question of possible use of submatch tracking with commands other than 's'; remember 's' was designated unnecessary). I wanted to see the thing be done in sam/acme without use of awk or sed. That is Charles Forsyth's solution, which really works: --- 1. delete everything not between delimiters ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d 2. delete the delimeters ,x/ABC|CBA/d 3. look to decide if i missed a boundary case for my input --- I like it. It does exactly what I wanted. And here comes the point I've been after all the time from the very beginning. I wanted to show, that the solution has a very ugly part in itself, namely ([^C]|C[^B]|CB[^A]|\n)+ whose only reason is to ensure there is not CBA somewhere in the middle. Imagine there would be something more complicated as a delimiter. Imagine, I'd like the closing delimiter be either CBA or EFG (any one would do). And I think you soon go mad. In python ( http://www.amk.ca/python/howto/regex/), this is easily solved with a non-greedy operator /ABC(.*?)CBA/ /ABC(.*?)(CBA|EFG)/ It's true that non-greedy operators don't have a good meaning in Plan9 (as R. Cox explained), due to its leftmost-longest paradigm. However, what I conclude from this example is, that the leftmost-first kind of thinking with two kinds of ops (greedy/nongreedy) can be sometimes quite useful. Now. If the leftmost-longest match is usable for my problem, I am fine with C + regexp(6). If not I only see the possibility to use perl/python nowadays (if I don't want to go mad like above). Put it the other way round. Perl/python can hopefully almost always be used, they solve almost any problem with regexps to our liking. Then universality wins and we may end up using perl/python exclusively. And we will (people do) use them inspite of their wrong (i.e. slow; perhaps horrible -- as some of you said) designs. My question then is: wouldn't it be better to switch to the leftmost-first paradigm, hence open possible use of (non-)greedy operators, and in a way contribute to an accord with perl/python syntax? And use a good algorithm for that all? But maybe it's not worth and the current state is just sufficient... Ruda